Skip to content

Conversation

@ehanson8
Copy link
Contributor

Purpose and background context

Refactor clients, config, and CLI functionality in addition adding a finalize CLI command. The changes have been logically grouped into several commits for easier review.

How can a reviewer manually see the effects of these changes?

Not possible yet

Includes new or updated dependencies?

YES

Changes expectations for external applications?

NO

What are the relevant tickets?

Developer

  • All new ENV is documented in README
  • All new ENV has been added to staging and production environments
  • All related Jira tickets are linked in commit message(s)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

  • The commit message is clear and follows our guidelines (not just this PR message)
  • There are appropriate tests covering any new functionality
  • The provided documentation is sufficient for understanding any new functionality introduced
  • Any manual tests have been performed or provided examples verified
  • New dependencies are appropriate or there were no changes

* Rename recipient_email_address > recipient_email_addresses and update type hint to list[str] to reflect expected usage
* Update unit tests to account for these changes
* Remove __getattr__ method
* Add workspace, sentry_dsn, aws_region_name, dss_input_queue, and dsc_source_email properties and update calls across repo
* Remove duplicate addHandler call in configure_logger method
Why these changes are being introduced:
* A CLI command is need to process the submissions results that are sent to a DSS output queue

How this addresses that need:
* Rename stream > log_stream
* Add finalize CLI command and corresponding CLI test
* Add process_results and send_logs methods to Workflow class and corresponding unit tests
* Add Config call in Workflow module
* Remove SimpleCSV.process_deposit_results method as the class will default to Workflow.process_results
* Reorder fixtures

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/IN-1105
Copy link

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All config changes looking great to me, thanks for pushing on that and leading a discussion.

I did leave a couple of comments about the CLI / Workflow relationship, and the approach of capturing logs as the primary output. I can remain open to it, but would be remiss without a bit more discussion either in this PR or a call about the approach.

def process_deposit_results(self) -> list[str]:
"""Process results generated by the deposit according to the workflow subclass.
def process_results(self) -> dict[str, Any]:
"""Process results in the workflow's output queue.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we expand on this docstring a bit? Maybe an opportunity to update the run() method as well. I think it could be helpful if the docstring touches on two things:

  1. what this method acheives for the Workflow, very high level, for humans
  2. a tiny bit of how it accomplishes this technically (e.g. mentioning SQS queues)

For this PR, I'm looking at the base Workflow class from kind of a 30ft view. My understanding is that:

  • CLI command deposit --> Workflow method run()
  • CLI command finalize --> Workflow method process_results()

I'm not strictly opposed to this naming mismatch, but I think for someone who is not actively in the codebase (e.g. myself) I wonder why this naming disconnect. I had interpreted run() to be the primary method for the Workflow in previous PRs, but I think this seemingly equally important method process_results() pushes on that a bit.

I'm not sure renaming is required (but open to it), but at the very least I think the docstring could hold your hand a little more when you land here from the CLI commands.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree this could use more detail, I'll add that. @jonavellecuerdo and I were also discussing possibly updating run > submit_items and the deposit CLI command to submit to be more accurate. deposit was another holdover from wiley-deposits that I've been moving away from but hadn't updated the CLI command yet

Copy link
Contributor

@jonavellecuerdo jonavellecuerdo Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wanted to note that at some point @ehanson8 mentioned maybe changing the CLI command to submit, but there was some hesitancy in that DSS also uses submit -> upload submissions to DSpace@MIT.

From my recollection, these CLI commands were taken from Wiley deposits. I'm in support of alignment in names, if anything:

  • CLI command submit -> Workflow.submit
  • CLI command process-results -> Workflow.process_results

Also agree comments re: Expanding info docstrings!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonavellecuerdo Jinx 🙂

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note: I just finished reading @ghukill 's comment on how the CLI command finalize calls two methods from Workflow and now am not sure about the second bullet point suggesting renaming finalize -> process_results. 🤔

dsc/cli.py Outdated
Comment on lines 137 to 138
workflow.process_results()
workflow.send_logs(ctx.obj["log_stream"])
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to locate this comment here, but it certainly has tendrils beyond these two lines. It may have connection to another comment about the CLI / Workflow class naming disconnect as well.

I suppose I'm still feeling a bit uneasy about setting up a StringIO logger, capturing virtually all logs while the Workflow works, and then shipping them off via email as the primary output of this app.

A few concerns come to mind:

  • it feels like it would be easy for someone not super familiar with this approach, to add a logging statement to code, that would start showing up in emails, given that logging is conventionally not directly shipped by an application
  • the order of logging matters: if we ever wanted to reorganize or cleanup the email sent out, we're kind of at the mercy of when and how things are logged

I suppose this feels like it has connection to the CLI/Workflow method naming, as why wouldn't we have a Workflow method like finalize(), that could call methods like process_results() and send_logs() (or maybe that gets reworked into something like report()).

From my POV, it would arguably be easier to capture messages -- e.g. on the Workflow instances -- as things happen, and then prepare a final "report" that gets emailed out. If this report generation were encapsulated in a method, you then have the ability to very easily modify it over time as requirements (invariably) change.

Happy to discuss more. I almost proposed it could wait, maybe be a ticket for returning to later, but it feels kind of fundamental to this last major action the Workflow takes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hear your concerns and it needs to be made clear in the eventual README.md how the logs are used, but let's huddle later today to discuss this in more detail

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ehanson8 @ghukill Is this comment addressed by the change of including a report_data instance attribute on the Workflow class? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I believe it was and we'll create a report class in the future, there's already a ticket!

Copy link
Contributor

@jonavellecuerdo jonavellecuerdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This review counts as an initial pass! I have one minor change request in addition to the 1-2 potential change requests proposed by @ghukill (updating Config to handle env vars without defaults and aligning names for CLI commands + Workflow methods).

I will also wait until we have a discussion re: how to capture emailed logs!

* Rename deposit CLI command > submit
* Rename Workflow.run > submit_items
* Rename Workflow.send_logs > report_results and refactor to use report_lines attribute rather than log stream
* Create process_sqs_queue method from refactored process_results functionality but retain process_results method as wrapper method
* Add workflow_specific_processing method and corresponding unit test
* Remove str conversion from SQSClient.process_result_message method
* Remove log stream functionality from CLI
* Remove log stream param from Config.configure_logger and update corresponding unit tests
* Update Config properties to raise exceptions and add corresponding unit tests
@coveralls
Copy link

coveralls commented Jan 21, 2025

Pull Request Test Coverage Report for Build 12910333914

Details

  • 81 of 81 (100.0%) changed or added relevant lines in 6 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+2.6%) to 99.783%

Totals Coverage Status
Change from base Build 12799339996: 2.6%
Covered Lines: 459
Relevant Lines: 460

💛 - Coveralls

@ehanson8
Copy link
Contributor Author

@ghukill @jonavellecuerdo Pushed a new commit based on further discussions!

Copy link

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple of conversational comments, but no changes requested. Thanks for facilitating all the discussions @ehanson8. The final product is looking good to me.

This PR introduces a lot, and made some mid-PR pivots, so didn't want to hold it up any longer with some of these questions. But I think as the reporting needs evolve, it might again put some pressure on what the application should do with it's "processing" results (and even, what that means). But at this time, some ambiguity is okay, and the stucture will support that evolution going forward.

Nice work!!

def process_deposit_results(self) -> list[str]:
"""Process results generated by the deposit according to the workflow subclass.
@final
def process_results(self) -> None:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I like about having this somewhat umbrella method on the base workflow, is it provides a single place to think deeply about what it should return (if anything). When this logic was partially in the CLI, it was hard to reason about.

It seems safe to say the side effects of this are two-fold:

  1. it retrieves messages from SQS, and if it can parse them, it deletes them
  2. it aggregates messages for a final reporting step

For some reason, when I really squint... it feels like the response of this should tell me something about the state of the output queue? Like, a tuple of the output queue and a count of messages still present?

("dss-output-wiley", 0)

Or, should it reach out and query DSpace and confrim the identifiers + handles it received actually exist? should that be a responsibility of this application?

No changes requested. As mentioned, mostly just commenting that I think the reworking has made discussing these kind of larger questions, easier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this app should have any direct interaction with DSpace, that's the responsibility of DSS. But not sure about returning the status of the output queue, the SQSClient.receive method iterates until it's exhausted and if something went haywire in would be caught in the except Exception block which would also go in the report sent to stakeholders.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But again, thinking more here about data flow in the app and less about reporting. Fully agreed that anything wrong when consuming the SQS output queue should be reported on.

But what data does the method return?

My hunch is that it doesn't matter much now, while this tool is manually run and the report is the sole and primary output. But when it gets automated, I have a feeling this will become more important. Like, what can this do + return such that programatically -- downstream in some kind of automation -- we can make decisions about the successful ingestion of items into DSpace.

But yes, don't think anything is needed now!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, a very good point and definitely something to keep in mind during the future automation conversations!

Copy link
Contributor

@jonavellecuerdo jonavellecuerdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for all your work in this PR! ✨ While I agree that it could be worth revisiting the reporting process later, I think the change to encapsulate report details as part of the Workflow class is a step in the right direction.

@ehanson8 ehanson8 merged commit fb1f96e into main Jan 27, 2025
2 checks passed
@ehanson8 ehanson8 deleted the IN-1105-finalize-command branch January 27, 2025 14:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants