In 1105 finalize command #80

ehanson8 · 2025-01-16T14:45:31Z

Purpose and background context

Refactor clients, config, and CLI functionality in addition adding a finalize CLI command. The changes have been logically grouped into several commits for easier review.

How can a reviewer manually see the effects of these changes?

Not possible yet

Includes new or updated dependencies?

YES

Changes expectations for external applications?

NO

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/IN-1105

Developer

All new ENV is documented in README
All new ENV has been added to staging and production environments
All related Jira tickets are linked in commit message(s)
Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

The commit message is clear and follows our guidelines (not just this PR message)
There are appropriate tests covering any new functionality
The provided documentation is sufficient for understanding any new functionality introduced
Any manual tests have been performed or provided examples verified
New dependencies are appropriate or there were no changes

* Rename recipient_email_address > recipient_email_addresses and update type hint to list[str] to reflect expected usage * Update unit tests to account for these changes

* Remove __getattr__ method * Add workspace, sentry_dsn, aws_region_name, dss_input_queue, and dsc_source_email properties and update calls across repo * Remove duplicate addHandler call in configure_logger method

Why these changes are being introduced: * A CLI command is need to process the submissions results that are sent to a DSS output queue How this addresses that need: * Rename stream > log_stream * Add finalize CLI command and corresponding CLI test * Add process_results and send_logs methods to Workflow class and corresponding unit tests * Add Config call in Workflow module * Remove SimpleCSV.process_deposit_results method as the class will default to Workflow.process_results * Reorder fixtures Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/IN-1105

ghukill

All config changes looking great to me, thanks for pushing on that and leading a discussion.

I did leave a couple of comments about the CLI / Workflow relationship, and the approach of capturing logs as the primary output. I can remain open to it, but would be remiss without a bit more discussion either in this PR or a call about the approach.

dsc/config.py

dsc/utilities/aws/ses.py

ghukill · 2025-01-16T14:59:00Z

dsc/workflows/base/__init__.py

-    def process_deposit_results(self) -> list[str]:
-        """Process results generated by the deposit according to the workflow subclass.
+    def process_results(self) -> dict[str, Any]:
+        """Process results in the workflow's output queue.


Can we expand on this docstring a bit? Maybe an opportunity to update the run() method as well. I think it could be helpful if the docstring touches on two things:

what this method acheives for the Workflow, very high level, for humans

a tiny bit of how it accomplishes this technically (e.g. mentioning SQS queues)

For this PR, I'm looking at the base Workflow class from kind of a 30ft view. My understanding is that:

CLI command deposit --> Workflow method run()

CLI command finalize --> Workflow method process_results()

I'm not strictly opposed to this naming mismatch, but I think for someone who is not actively in the codebase (e.g. myself) I wonder why this naming disconnect. I had interpreted run() to be the primary method for the Workflow in previous PRs, but I think this seemingly equally important method process_results() pushes on that a bit.

I'm not sure renaming is required (but open to it), but at the very least I think the docstring could hold your hand a little more when you land here from the CLI commands.

Agree this could use more detail, I'll add that. @jonavellecuerdo and I were also discussing possibly updating run > submit_items and the deposit CLI command to submit to be more accurate. deposit was another holdover from wiley-deposits that I've been moving away from but hadn't updated the CLI command yet

Just wanted to note that at some point @ehanson8 mentioned maybe changing the CLI command to submit, but there was some hesitancy in that DSS also uses submit -> upload submissions to DSpace@MIT.

From my recollection, these CLI commands were taken from Wiley deposits. I'm in support of alignment in names, if anything:

CLI command submit -> Workflow.submit

CLI command process-results -> Workflow.process_results

Also agree comments re: Expanding info docstrings!

@jonavellecuerdo Jinx 🙂

Side note: I just finished reading @ghukill 's comment on how the CLI command finalize calls two methods from Workflow and now am not sure about the second bullet point suggesting renaming finalize -> process_results. 🤔

ghukill · 2025-01-16T15:16:52Z

dsc/cli.py

+    workflow.process_results()
+    workflow.send_logs(ctx.obj["log_stream"])


Going to locate this comment here, but it certainly has tendrils beyond these two lines. It may have connection to another comment about the CLI / Workflow class naming disconnect as well.

I suppose I'm still feeling a bit uneasy about setting up a StringIO logger, capturing virtually all logs while the Workflow works, and then shipping them off via email as the primary output of this app.

A few concerns come to mind:

it feels like it would be easy for someone not super familiar with this approach, to add a logging statement to code, that would start showing up in emails, given that logging is conventionally not directly shipped by an application

the order of logging matters: if we ever wanted to reorganize or cleanup the email sent out, we're kind of at the mercy of when and how things are logged

I suppose this feels like it has connection to the CLI/Workflow method naming, as why wouldn't we have a Workflow method like finalize(), that could call methods like process_results() and send_logs() (or maybe that gets reworked into something like report()).

From my POV, it would arguably be easier to capture messages -- e.g. on the Workflow instances -- as things happen, and then prepare a final "report" that gets emailed out. If this report generation were encapsulated in a method, you then have the ability to very easily modify it over time as requirements (invariably) change.

Happy to discuss more. I almost proposed it could wait, maybe be a ticket for returning to later, but it feels kind of fundamental to this last major action the Workflow takes.

I hear your concerns and it needs to be made clear in the eventual README.md how the logs are used, but let's huddle later today to discuss this in more detail

@ehanson8 @ghukill Is this comment addressed by the change of including a report_data instance attribute on the Workflow class? 🤔

Yes, I believe it was and we'll create a report class in the future, there's already a ticket!

jonavellecuerdo

This review counts as an initial pass! I have one minor change request in addition to the 1-2 potential change requests proposed by @ghukill (updating Config to handle env vars without defaults and aligning names for CLI commands + Workflow methods).

I will also wait until we have a discussion re: how to capture emailed logs!

README.md

* Rename deposit CLI command > submit * Rename Workflow.run > submit_items * Rename Workflow.send_logs > report_results and refactor to use report_lines attribute rather than log stream * Create process_sqs_queue method from refactored process_results functionality but retain process_results method as wrapper method * Add workflow_specific_processing method and corresponding unit test * Remove str conversion from SQSClient.process_result_message method * Remove log stream functionality from CLI * Remove log stream param from Config.configure_logger and update corresponding unit tests * Update Config properties to raise exceptions and add corresponding unit tests

coveralls · 2025-01-21T20:12:56Z

Pull Request Test Coverage Report for Build 12910333914

Details

81 of 81 (100.0%) changed or added relevant lines in 6 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+2.6%) to 99.783%

Totals
Change from base Build 12799339996:	2.6%
Covered Lines:	459
Relevant Lines:	460

💛 - Coveralls

ehanson8 · 2025-01-21T20:19:02Z

@ghukill @jonavellecuerdo Pushed a new commit based on further discussions!

ghukill

Left a couple of conversational comments, but no changes requested. Thanks for facilitating all the discussions @ehanson8. The final product is looking good to me.

This PR introduces a lot, and made some mid-PR pivots, so didn't want to hold it up any longer with some of these questions. But I think as the reporting needs evolve, it might again put some pressure on what the application should do with it's "processing" results (and even, what that means). But at this time, some ambiguity is okay, and the stucture will support that evolution going forward.

Nice work!!

dsc/workflows/base/__init__.py

ghukill · 2025-01-21T21:09:47Z

dsc/workflows/base/__init__.py

-    def process_deposit_results(self) -> list[str]:
-        """Process results generated by the deposit according to the workflow subclass.
+    @final
+    def process_results(self) -> None:


What I like about having this somewhat umbrella method on the base workflow, is it provides a single place to think deeply about what it should return (if anything). When this logic was partially in the CLI, it was hard to reason about.

It seems safe to say the side effects of this are two-fold:

it retrieves messages from SQS, and if it can parse them, it deletes them

it aggregates messages for a final reporting step

For some reason, when I really squint... it feels like the response of this should tell me something about the state of the output queue? Like, a tuple of the output queue and a count of messages still present?

("dss-output-wiley", 0)

Or, should it reach out and query DSpace and confrim the identifiers + handles it received actually exist? should that be a responsibility of this application?

No changes requested. As mentioned, mostly just commenting that I think the reworking has made discussing these kind of larger questions, easier.

I don't think this app should have any direct interaction with DSpace, that's the responsibility of DSS. But not sure about returning the status of the output queue, the SQSClient.receive method iterates until it's exhausted and if something went haywire in would be caught in the except Exception block which would also go in the report sent to stakeholders.

But again, thinking more here about data flow in the app and less about reporting. Fully agreed that anything wrong when consuming the SQS output queue should be reported on.

But what data does the method return?

My hunch is that it doesn't matter much now, while this tool is manually run and the report is the sole and primary output. But when it gets automated, I have a feeling this will become more important. Like, what can this do + return such that programatically -- downstream in some kind of automation -- we can make decisions about the successful ingestion of items into DSpace.

But yes, don't think anything is needed now!

Ah, a very good point and definitely something to keep in mind during the future automation conversations!

tests/test_workflow.py

jonavellecuerdo

Thank you for all your work in this PR! ✨ While I agree that it could be worth revisiting the reporting process later, I think the change to encapsulate report details as part of the Workflow class is a step in the right direction.

ehanson8 added 7 commits January 16, 2025 09:32

Add pragma no cover to type checking blocks

2a77585

Update SESClient method arg

05cdcb4

* Rename recipient_email_address > recipient_email_addresses and update type hint to list[str] to reflect expected usage * Update unit tests to account for these changes

Add DSC DSC_SOURCE_EMAIL env variable

53fcc9c

Update SQSClient var names and logging messages

f27c510

Update Config class

0e695f4

* Remove __getattr__ method * Add workspace, sentry_dsn, aws_region_name, dss_input_queue, and dsc_source_email properties and update calls across repo * Remove duplicate addHandler call in configure_logger method

Update Pipfile.lock

f683a85

ehanson8 requested review from ghukill and jonavellecuerdo January 16, 2025 14:45

ghukill reviewed Jan 16, 2025

View reviewed changes

jonavellecuerdo reviewed Jan 16, 2025

View reviewed changes

README.md Show resolved Hide resolved

ghukill approved these changes Jan 21, 2025

View reviewed changes

Update report_lines > report_data

8e05358

jonavellecuerdo approved these changes Jan 24, 2025

View reviewed changes

ehanson8 merged commit fb1f96e into main Jan 27, 2025
2 checks passed

ehanson8 deleted the IN-1105-finalize-command branch January 27, 2025 14:29

		workflow.process_results()
		workflow.send_logs(ctx.obj["log_stream"])

In 1105 finalize command #80

In 1105 finalize command #80

Uh oh!

Conversation

ehanson8 commented Jan 16, 2025

Purpose and background context

How can a reviewer manually see the effects of these changes?

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Developer

Code Reviewer(s)

Uh oh!

ghukill left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jonavellecuerdo Jan 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jonavellecuerdo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coveralls commented Jan 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 12910333914

Details

💛 - Coveralls

Uh oh!

ehanson8 commented Jan 21, 2025

Uh oh!

ghukill left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jonavellecuerdo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jonavellecuerdo Jan 16, 2025 •

edited

Loading

coveralls commented Jan 21, 2025 •

edited

Loading