Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GDCC/8605-add-archival-status-support #8696

Conversation

qqmyers
Copy link
Member

@qqmyers qqmyers commented May 13, 2022

What this PR does / why we need it: To support more sophisticated Archivers (i.e. those that can provide status feedback and may have multistep internal processes), this PR adds support for managing this status. Specifically it changes the archivalCopyLocation from being a null/String (originally intended as a URL identifying/providing a landing page for the archival copy in the archiving system) to being a json object that contains a 'status' of 'success'/'pending'/'failure' and a 'message' that is again a string. In the success case, the message is again intended as an identifier/landing page URL whereas for failure and pending, the message can be an informative string.
As noted in the issue, this work is supported as part of the Harvard Data Commons project (3A) for use specifically with the DRS Archiver. However, the PR includes updates to the other existing archivers to use the same format (although these currently only have success and failure status, no pending states.)

Which issue(s) this PR closes:

Special notes for your reviewer:

  • Could rename the db column as it is no longer a location.
  • The API calls follow the original naming convention of the admin API submitDatasetVersionToArchive format which doesn't fit as well with the /api/datasets convention of having the next. These could be changed - would require changes in the DataCommons service that calls them - and presumably we/d want to align the existing admin call and batch call in TDL/7493 Batch Archiving #8610 - let me know a decision.
  • Also note that the flyway script also handles the 'Attempted' state introduced in GDCC/8604 Improve archiver error handling #8612. Nominally this should only be in development databases and at TDL where this was added to avoid rerunning the archiving for failed datasets when doing batch uploads. That will be superseded/replaced by this PR.
  • FWIW: This API was added to the /datasets endpoint because the intent is for remote archiving systems (like DRS) to report their status updates and putting it in admin would restrict it to localhost or require changing to the unblock-key policy. The API is limited to superuser use.

Suggestions on how to test this: The new API supports get/set/delete of the status values. The simplest test would be to configure an archiver, such as the Local file archiver and use the API to retrieve the status and verify the success message. (I think misconfiguration of that, e.g. pointing to a directory where the archiver can't write, should allow viewing a failure status as well.
Also note that another PR will be coming that will show the archival status in the versions table - more opportunity to test the api with that.

Does this PR introduce a user interface change? If mockups are available, please link/include them here: No - this is db/api only

Is there a release notes update needed for this change?: part of #8611

Additional documentation:

@coveralls
Copy link

coveralls commented May 13, 2022

Coverage Status

Coverage decreased (-0.03%) to 19.736% when pulling 7410c5b on GlobalDataverseCommunityConsortium:GDCC/8605-add-archival-status into 567e506 on IQSS:develop.

@qqmyers qqmyers marked this pull request as ready for review May 13, 2022 20:53
@qqmyers qqmyers added the HDC: 3a Harvard Data Commons Obj. 3A label May 17, 2022
@qqmyers qqmyers added the HDC Harvard Data Commons label May 24, 2022
@sekmiller sekmiller self-assigned this Jun 6, 2022
@sekmiller sekmiller removed their assignment Jun 24, 2022
@pdurbin pdurbin self-assigned this Jul 14, 2022
Copy link
Member

@pdurbin pdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't run the code yet but here's some initial feedback.

src/main/java/edu/harvard/iq/dataverse/DatasetVersion.java Outdated Show resolved Hide resolved
src/main/java/edu/harvard/iq/dataverse/DatasetVersion.java Outdated Show resolved Hide resolved
}
Dataset ds = findDatasetOrDie(dsid);

DatasetVersion dv = datasetversionService.findByFriendlyVersionNumber(ds.getId(), versionNumber);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to use getDatasetVersionOrDie here (and in the other two calls to findByFriendlyVersionNumber in this PR)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I saw it but looking now, getDatasetVersionOrDie doesn't support the friendlyVersionNumber syntax which is a ~requirement here (that's the convention used in the Bag naming and metadata that the archiver gets). I can go ahead and add parsing for that which would have the presumably useful side effect of letting other datasetversion api calls support the friendly version number as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should. I'm seeing handleSpecific(long major, long minor). It's used by https://guides.dataverse.org/en/5.11/api/native-api.html#get-version-of-a-dataset which has a "friendly" example of "1.0".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep - you're right. I missed the string parsing in handleVersion(). I'll update the PR to use it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm - calls to this are counted with MakeDataCounts. I guess since these are API calls they should count? (although they are clearly system-level interactions and not end-user interaction with the data). In any case, I went ahead for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dunno. I'd leave this out of Make Data Count. Like you said, these are systems setting and retrieving archival status messages. The spirit of Make Data Count is views/investigations and downloads/requests. People and machines looking at data.

src/main/java/edu/harvard/iq/dataverse/api/Datasets.java Outdated Show resolved Hide resolved
src/main/java/edu/harvard/iq/dataverse/api/Datasets.java Outdated Show resolved Hide resolved

@GET
@Produces(MediaType.APPLICATION_JSON)
@Path("/submitDatasetVersionToArchive/{id}/{version}/status")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

submitDatasetVersionToArchive is a weird name. submitDataVersionToArchive (Data instead of Dataset) is under /api/admin and documented under installation/config.html

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. So far it ~mirrors the /api/admin/submitDatasetVersionToArchive call (name changed to say 'Dataset' in #8610 which hasn't merged yet), which seemed reasonable when it was a single call. With the status calls, I initially had them in /api/admin as well, but eventually decided they should move to /api/datasets (see the comment about superuser being required on those). With that, they could be renamed - e.g. to /api/datasets/<id>/<version>/archivalStatus .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the new name ending with /archivalStatus. Thanks.

src/main/java/edu/harvard/iq/dataverse/DatasetVersion.java Outdated Show resolved Hide resolved
@@ -0,0 +1,2 @@
UPDATE datasetversion SET archivalCopyLocation = CONCAT('{"status":"success", "message":"', archivalCopyLocation,'"}') where archivalCopyLocation is not null and not archivalCopyLocation='Attempted';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this script is only needed by TDL as suggested in the PR description, perhaps we don't need it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UPDATE datasetversion SET archivalCopyLocation = CONCAT('{"status":"success", "message":"', archivalCopyLocation,'"}') where archivalCopyLocation is not null is needed for standard instances (those that have used archiving and therefore have non-null entries). The and not archivalCopyLocation='Attempted'; and the second line handle the case that TDL deployed which was in the initial PR #8610 which has gotten passed by this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I guess my understanding is that both lines are needed or at least won't hurt anything.

@qqmyers qqmyers removed their assignment Jul 18, 2022
Copy link
Member

@pdurbin pdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I played around with the tests in DatasetsIT. I didn't test the SQL upgrade script.

IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) automation moved this from Review 🔎 to QA ✅ Jul 19, 2022
@pdurbin pdurbin removed their assignment Jul 19, 2022
@kcondon kcondon merged commit fed27f9 into IQSS:develop Jul 21, 2022
IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) automation moved this from QA ✅ to Done 🚀 Jul 21, 2022
@kcondon kcondon self-assigned this Jul 25, 2022
@pdurbin pdurbin added this to the 5.12 milestone Jul 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
HDC Harvard Data Commons HDC: 3a Harvard Data Commons Obj. 3A
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

HDC 3A: support handling archival status updates
6 participants