Skip to content

Conversation

@jonavellecuerdo
Copy link
Contributor

@jonavellecuerdo jonavellecuerdo commented Feb 6, 2025

Purpose and background context

The app requires a framework for creating reports that summarize results from DSC workflow runs. This framework should support creating different reports for the three main DSC CLI commands: reconcile, submit, and finalize. To demonstrate the construction of a report, this PR adds a FinalizeReport, which represents the email that is sent when the finalize CLI command is executed. The FinalizeReport produces an email with the following:

  • Subject heading: "DSpace Submission Results - <workflow_name>, batch='<batch_id>'"
  • File attachments: a CSV file describing successfully deposited items (columns for item_identifier and doi) and a text file for captured error messages (from processing DSS results)
  • Message body: See text in 'finalize' HTML and plain-text templates.

This PR touches several files, so here are some notes for reviewers:

  • One small change in the Makefile is to fix the lint-apply syntax (to run multiple Make commands, it seems they must be on the same line.
  • The new dsc/templates contain 4 very simply formatted HTML and plain-text templates. There is a 'base' template that should be extended by templates for 'child' reports (i.e., report templates for reconcile, submit, and finalize).
  • Recommend reviewing from top-level to low-level (i.e., start with the cli then dig 🤓 ). See order of changes described in commit message.

How can a reviewer manually see the effects of these changes?

Note: I contemplated whether to add unit tests to verify the rendering of the report templates, but I would prefer to review the resulting email once we have the AWS infrastructure in place to send and review test emails. Once that is in place, I plan on including screenshots to this PR/or the ticket for reference. However, this shouldn't block the PR from being merged!

Includes new or updated dependencies?

YES - installs jinja2; installs freezegun as 'dev-package'.

Changes expectations for external applications?

YES - As of this writing, the existing report templates are quite simple,
with the goal of getting to an minimum viable product. It would be good
to include in coming documentation what email recipients can expect
in the emailed report.

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/IN-1157

Developer

  • All new ENV is documented in README
  • All new ENV has been added to staging and production environments
  • All related Jira tickets are linked in commit message(s)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

  • The commit message is clear and follows our guidelines (not just this PR message)
  • There are appropriate tests covering any new functionality
  • The provided documentation is sufficient for understanding any new functionality introduced
  • Any manual tests have been performed or provided examples verified
  • New dependencies are appropriate or there were no changes

Why these changes are being introduced:
* The app requires a framework for creating reports
that summarize results from DSC workflow runs. This framework
should support creating different reports for the three main
DSC CLI commands: reconcile, submit, and finalize.

How this addresses that need:
* Add boolean flag option 'create-and-send-report' to main CLI command
* Apply 'create-and-send-report' to 'finalize' CLI command
* Add report module for creating email data (i.e., subject heading,
file attachments, message body)
* Add HTML and plain-text templates (rendered by Jinja)
* Update SESClient to render message body in emails and accept multiple
attachments
* Create WorkflowEvents dataclass to capture useful report data
during execution of Workflow methods (replaces 'report_data' attribute)
* Add test module for reporting framework
* Update tests as needed

Side effects of this change:
* As of this writing, the existing report templates are quite simple,
with the goal of getting to an minimum viable product. It would be good
to include in coming documentation what email recipients can expect
in the emailed report.

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/IN-1157
@jonavellecuerdo jonavellecuerdo marked this pull request as ready for review February 6, 2025 14:37
@jonavellecuerdo jonavellecuerdo requested review from a team, ehanson8 and ghukill February 6, 2025 14:37
Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really great, a number of questions and comments but this was easy to read and understand how it integrates into the app. Fantastic work

dsc/cli.py Outdated
ctx: click.Context,
workflow_name: str,
batch_id: str,
create_and_send_report: bool, # noqa: FBT001
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to put this on main rather than the CLI command itself? Then there would be no need to add it to the context object

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dsc/cli.py Outdated
workflow.process_results()
workflow.report_results(email_recipients.split(","))

if ctx.obj["create_and_send_report"]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this boolean check is better suited to submit, I have a hard time imagining a workflow where the app wouldn't report out the results of the DSS run.

Also, if the create_and_send_report param is moved to submit, it could look this:

if create_and_send_report:
    workflow.send_report(
            report_class=FinalizeReport, email_recipients=email_recipients.split(",")
        )

And it could probably also be simplified to just report

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I guess the idea behind the current structure is: giving the user an option to "send reports" (i.e., emails) that summarize the outcome of each DSC CLI command: reconcile, submit, finalize, and related to comment above, #121 (comment), this would mean adding the boolean flag option to all CLI commands.

I'm okay with renaming create_and_send_report to report, but perhaps we can discuss this in our meeting this afternoon!

Copy link

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a first pass, some questions and requests for discussion.

Overall, think it looks great! I had the opportunity to discuss and see sketches of this via a huddle, so felt very easy to reconigze how it would work. But I think that would be true even without pre-exposure.

For my own thinking, I break it down into these big blocks:

  • Workflows always capture noteworthy events that happened while they worked, for reconcile, submit, or finalize
  • Reports can optionally be created after a workflow completes, by using data from the complete workflow object (name, events, etc.)
  • Reports can serialize themselves into plain text or HTML
  • Workflows have the ability to "report out", so it's their method send_report that is used

As I type all this... I wonder if Workflow's send_report() should actually be unaware of a Report object? Adding as a final comment in my review, only surfaced from this typing here.

But, overall, looking real good.

dsc/report.py Outdated
Comment on lines 170 to 190
def to_plain_text(self) -> str:
template = self.env.get_template(f"{self.template_name}.txt")
return template.render(
workflow_name=self.workflow_name,
batch_id=self.batch_id,
report_date=self.report_date,
status=self.status,
processed_items=self.events.processed_items,
errors=self.events.errors,
)

def to_rich_text(self) -> str:
template = self.env.get_template(f"{self.template_name}.html")
return template.render(
workflow_name=self.workflow_name,
batch_id=self.batch_id,
report_date=self.report_date,
status=self.status,
processed_items=self.events.processed_items,
errors=self.events.errors,
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to the comment above about the base class having abstract methods for these to_plain_text() and to_rich_text(), perhaps if these are defined on each class they could just hardcode the precise template file they want to use?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking another parallel from the workflow classes, could self.env.get_template("hard_coded_template_name}.txt") be a property that is set in the subclasses but called here?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that pattern @ehanson8. And maybe could extend one hop further. Maybe the properties are the actual template filename, which allows leaving self.env.get_template(...) the same for all classes, and therefore set as a property on the base Report class?

Like:

class Report:
    @property
    @abstractmethod
    def jinja_template_plain_text_filename(self): ...

    @property
    @abstractmethod
    def jinja_template_html_filename(self): ...

    @property
    def jinja_template_plain_text(self):
        return self.env.get_template(self.jinja_template_plain_text_filename)

    @property
    def jinja_template_html(self):
        return self.env.get_template(self.jinja_template_html_filename)


class FinalReport(Report):
    @property
    def jinja_template_plain_text_filename(self):
        return "final.txt"

    @property
    def jinja_template_html_filename(self):
        return "final.html"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that extension a lot!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See changes in commit 84652e9

Comment on lines 322 to 330
items.append(
{"item_identifier": item_identifier, "message_body": message_body}
)
self.workflow_events.processed_items.append(
{
"item_identifier": item_identifier,
"doi": message_body["ItemHandle"],
}
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two objects look very similar. This also goes back to an interesting discussion I recall having with @ehanson8 about passing data vs reporting.

For the WorkflowEvents, do we want to accumlate objects like this? Or might we want more human-friendly strings?

Argument for objects: this gives the report the ability to change what it retrieves from the object later, and how it displays that.

Argument for strings: the workflow is deciding how to "spin" or describe the event, and it's a static string that it provides.

I don't think there is a "correct" answer, just an opinionated one. But I do think answering that question has bearing on how these two objects look very similar.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the reasons shared above, I think I prefer objects. It seems like it would be easier to either create a custom report class or update a template vs. changing a string in WorkflowEvents (and in Workflow method calls where WorkflowEvents is populated) whenever a rewording is needed. Similarly, adding a data point to WorkflowEvents if something is missing and then updating the report template feels more flexible...

What do you think, @ehanson8 and @ghukill ? 🤔

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonavellecuerdo - objects make sense to me! I guess my followup regarding their near similarity: why not use the same object for both then? It looks like the reporting object just is a subset of the SQS message body, but if you pass the whole thing, then reporting has more options on what to selectively report on.

* Update CLI help description for 'create-and-send-report'
* Rename 'Report.env' -> 'Report.jinja_env'
* Keep original Workflow.workflow_name (undo capitalization for Report.workflow_name)
* Rename 'Report.to_rich_text' -> 'Report.to_html'
* Create 'reports' subfolder
* Add property, abstract methods for Jinja template filenames to
'Report' classes
* Add property method for loading templates
@coveralls
Copy link

coveralls commented Feb 7, 2025

Pull Request Test Coverage Report for Build 13244853659

Details

  • 107 of 120 (89.17%) changed or added relevant lines in 7 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-2.0%) to 96.552%

Changes Missing Coverage Covered Lines Changed/Added Lines %
dsc/reports/base.py 37 38 97.37%
dsc/reports/finalize.py 29 41 70.73%
Totals Coverage Status
Change from base Build 13163819468: -2.0%
Covered Lines: 560
Relevant Lines: 580

💛 - Coveralls

* Remove boolean CLI option and make reporting default, add stub
methods for 'reconcile' and 'submit' commands
* Create 'reports' subfolder
* Update Report.to_html and Report.to_plain_text to abstract methods
* Remove Report.status property method
* Simplify report templates
* Reorganize logic in Workflow.process_sqs_queue, capturing all processed items
and differentiate between 'processing errors' and 'non-ingested items'
Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fantastic work and I would fully approve but there's one variable rename that I don't think is accurate unless I'm missing something. Otherwise, it looks great!

for sqs_message in sqs_client.receive():
try:
item_identifier, message_body = sqs_client.process_result_message(
item_identifier, result_message = sqs_client.process_result_message(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is technically just the message body that's returned by process_result_message, the result message contains the message attributes as well (where the only thing we care about currently is the item_identifier which we're already extracting)

Copy link

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!! Great discussions last week. Feel good about where it landed here to move forward with.

Comment on lines 329 to 343
# capture all processed items, whether ingested or not
item_data = {
"item_identifier": item_identifier,
"result_message": result_message,
"ingested": result_message["ResultType"] == "success",
}
self.workflow_events.processed_items.append(item_data)
items.append(item_data)

if not item_data["ingested"]:
message = (
f"Item '{item_identifier}' did not ingest successfully: {sqs_message}"
)
logger.info(message)
self.workflow_events.errors.append(message)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After our discussions last week, I had to do a double/triple take here, but this feels right.

What I like about this flow:

  • we always put the processed item into the WorkflowEvents.processed_items list, whether ingested or not
  • we save human-friendly message strings to WorkflowEvents.errors

Perhaps as reports grow in the future, we might want to lean even heavier into that and call it something like WorkflowEvents.error_messages or something, but maybe not required.

What we should be careful about -- and maybe this even suggests a docstring sometime -- is not treating Workflow.WorkflowEvents as data to feed into more work with the items. My understanding is that it's 100% about reporting... they just so happen to be structured objects.

@jonavellecuerdo jonavellecuerdo merged commit 64ab68d into main Feb 10, 2025
2 checks passed
@jonavellecuerdo jonavellecuerdo deleted the IN-1157-build-report-system branch February 10, 2025 16:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants