Skip to content

Conversation

@jonavellecuerdo
Copy link
Contributor

Purpose and background context

Update app description in README and add docs explaining the DSC workflow and the DSC CLI.

Includes new or updated dependencies?

NO

Changes expectations for external applications?

NO

What are the relevant tickets?

@jonavellecuerdo jonavellecuerdo requested a review from a team as a code owner October 20, 2025 16:29

_This documentation describes the DSC workflow and how to run the application._

**DISCLAIMER**: While the CLI application is runnable on its own, the DSO Step Function offers a simplified user interface for running the full ETL pipeline. For more details on the DSO Step Function and how to use it, see https://mitlibraries.atlassian.net/wiki/spaces/IN/pages/4690542593/DSpace+Submission+Orchestrator+DSO.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will be focusing on documenting the step function in the DSO Confluence page next!

@coveralls
Copy link

coveralls commented Oct 20, 2025

Pull Request Test Coverage Report for Build 18690086927

Details

  • 1 of 1 (100.0%) changed or added relevant line in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 95.234%

Totals Coverage Status
Change from base Build 18565553930: 0.0%
Covered Lines: 1059
Relevant Lines: 1112

💛 - Coveralls

sync Sync data between two directories using the aws s3 sync...
```

### `pipenv run dsc -w <workflow-name> -b <batch-id> create`
Copy link
Contributor Author

@jonavellecuerdo jonavellecuerdo Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will notice that when using DSC, we will need to provide the --workflow-name / -w and --batch-id / -b (CLI options even if just want to run <command-name> help). This is because we currently define the options on the main command group and set them as required=True.

We could either:

A. Decorate each command with the -w and -b CLI options and remove setting in main command group.
B. Create a shared_cli_options decorator that contains all CLI options used by the CLI commands (i.e., see example in CDPS-CURT)
C. Do nothing

@ghukill @ehanson8 Let me know what you think! This would, of course, be outside the scope of this PR.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm easy. Whenever we get to that point, whichever approach is feeling right works for me. I feel like everytime I really dig into click, I learn a new way to do something.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, just create a ticket with that comment and whoever picks it up can make the call

Copy link

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I think it's a great start and foundation. As you mentioned, it's a living document and will likely get some updates.

Left a few comments, with the biggest couple related to how to inform readers that DSC does not actually perform ingests (or even trigger DSS directly).

It's important to note that DSC is not responsible for ingesting items into DSpace; this task is handled by _DSS_. The DSC CLI provides commands for all other steps in the DSC workflow.

### Create a batch
DSC processes deposits in "batches", a collection of item submissions grouped by a unique identifier. DSC requires that the item submission assets (metadata and bitstream files) are uploaded to a "folder" in S3, named after the batch ID. While some requestors may upload the submission assets to S3 themselves, in other cases, these files need to be retrieved (via API requests) and uploaded during the batch creation step.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While some requestors may upload the submission assets to S3 themselves, in other cases, these files need to be retrieved (via API requests) and uploaded during the batch creation step.

I might propose rewording this a bit. I would try and communciate that some workflows are unique in that they utilize APIs to other systems to retrieve metadata and/or bitstreams and upload them to the batch folder in S3. Try and highlight that the end result for all workflows is a batch folder in S3, but that some workflows perform some of this work programatically as part of the workflow.

Feel free to pushback, but I tripped a bit on the second part:

... in other cases, these files need to be retrieved (via API requests) and uploaded during the batch creation step

where it kind of sounds like a human user might still be involved?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the updated description. Thanks for the discussion!

Comment on lines 52 to 58
### Run DSS
DSS consumes the submission messages from the input queue in SQS. DSS uses a client to interact with DSpace. For each item submission, DSS reads the metadata JSON file and bitstreams from S3, using the information provided in the message, and creates an item with bitstreams in DSpace.

At the end of this step:
* Result messages are written to the output queue for DSC (`dss-output-dsc`).

Note: The message is structured in accordance with the [Result Message Specification](https://github.com/MITLibraries/dspace-submission-service/blob/main/docs/specifications/result-message-specification.md).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would propose emphasizing somewhere here that DSC does not run DSS, and thus does not perform this step.

Please see another comment in the ## The DSC Workflow section asking a related question about this section.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to:

Items are ingested into DSpace by DSS

📌 Reminder: DSS is not executed by DSC and requires separate invocation.

What do you think?


Note: The message is structured in accordance with the [Result Message Specification](https://github.com/MITLibraries/dspace-submission-service/blob/main/docs/specifications/result-message-specification.md).

### Inspect ingest results
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nit picky, very subjective, totally feel free to ignore...

I might propose "Analyze" vs "Inspect". For reasons I'm having a hard time articulating, "Inspect" feels like it will then do something with what it learns, while "Analyze" feels like it has a more reporting vibe, which feels more accurate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a second read, "inspect" makes me think we're looking at the item ingested in DSpace, and with that, I agree "analyze" somehow feels different. 🤔

Comment on lines 9 to 13
The DSC workflow consists of the following key steps:

1. Create a batch
2. Queue a batch for ingest
3. Ingest items into DSpace
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reading the full doc, returning here.

From a DSC lens, it might be helpful to portray number 3, "Ingest items into DSpace" more as something that happens versus something DSC does?

For example, if it was "Items are ingested into DSpace by DSS", I think it would inform the language in the section below dedicated to this, which I commented on as well.

I can appreciate it's very difficult, but I think if someone were brand new to all this, they'd need a lot of help understanding that DSC does not actually perform the ingest. Once the StepFunction is more fully formed, and perhaps has it's own documentation somewhere (e.g. Confluence), it might be an opportunity in this document to link to it.

Overall, likely very little changes are probably needed here, just a bunch of little nudges that DSC doesn't do this work. It doesn't even trigger this work, that's actually DSO (StepFunction) that would do that.

Copy link
Contributor Author

@jonavellecuerdo jonavellecuerdo Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to:

The DSC workflow consists of the following key steps:

  1. Create a batch
  2. Queue a batch for ingest
  3. Items are ingested into DSpace by DSS*
  4. Analyze ingest results

*Important: DSC is not responsible for ingesting items into DSpace nor does it execute this process. This task is handled by DSS, which is invoked via the DSO Step Function.

What do you think?

Comment on lines +16 to +32
### Create a batch
DSC processes deposits in "batches", a collection of item submissions grouped by a unique identifier. Generally, assets for batches are provided in one of two ways:

1. Requestors upload raw metadata and bitstreams (see [How To: Batch deposits with DSO for Content Owners](https://mitlibraries.atlassian.net/wiki/spaces/INF/pages/4411326470/How-To+Batch+deposits+with+DSpace+Submission+Orchestrator+DSO+for+Content+Owners))
2. DSC workflows retrieve raw metadata and/or bitstreams programmatically (via API requests)

Once all assets are stored in a "folder" in S3, DSC verifies that metadata and bitstreams have been provided for each item submission. It is only after _all_ item submissions have been verified that DSC will establish the batch by recording each item in DynamoDB.

At the end of this step:
* If all item submission assets are complete:
- A batch folder with complete item submission assets exists in the DSO S3 bucket
- Each item submission in the batch is recorded in DynamoDB (with `status="batch_created"`)
- **[OPTIONAL]** An email is sent reporting the number of created item submissions. The email includes a CSV file with the batch records from DynamoDB.
* If any item submission assets were invalid (missing metadata and/or bitstreams):
- A batch folder with incomplete item submission assets exists in the DSO S3 bucket
- **[OPTIONAL]** An email is sent reporting that zero item submissions were created. The email
includes a CSV file describing the failing item submissions with the corresponding error message.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:chef's kiss:

Looks great! I feel like it communicates quite a bit of complexity, pretty succintly.

Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work, very minor suggestions!

Running DSC consists of the following key steps:

1. Create a batch
2. Queue a batch for ingest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: I know the Step Function doc uses the word "queue" for the submit step as well but you might consider whether to just use "submit" to avoid potential confusion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, might leave as-is for now. I like that it alludes to the SQS queues. 🤔

Commands:
create Create a batch of item submissions.
finalize Process the result messages from the DSS output queue...
reconcile Reconcile bitstreams with item identifiers from the metadata.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be removed or were we planning to do that later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! When we remove the reconcile code from DSC, we can update then. :)


Commands:
create Create a batch of item submissions.
finalize Process the result messages from the DSS output queue...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excess periods here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, this was from text-wrapping -- i.e., shows up automatically when Click prints the help docs to console~

sync Sync data between two directories using the aws s3 sync...
```

### `pipenv run dsc -w <workflow-name> -b <batch-id> create`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, just create a ticket with that comment and whoever picks it up can make the call

@jonavellecuerdo jonavellecuerdo merged commit 0888f6d into main Oct 22, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants