TDL/7493 Batch Archiving #8610

qqmyers · 2022-04-13T21:58:31Z

What this PR does / why we need it: This PR implements an api call to allow archiving of all unarchived datasets/only the latest version of unarchived datasets that can be used in lieu of configuring a post-publish workflow, or used as a catchup mechanism to archive datasets that were published prior to an archiving workflow being configured. The api also has a view-only param that will list dataset versions that will be archived without doing it.

Which issue(s) this PR closes:

Closes #7493

Special notes for your reviewer: The code includes a ToDo to update the logging - once the changes to support Failure/Pending/Success status (#8696) is merged, this code should/will be updated to appropriately count Failure status as failed and Pending status as a success (the processing is async so pending and success are both successful launches w.r.t. this code.

Suggestions on how to test this: Setup your favorite archiver (e.g. the Local one), publisha few datasets /versions and then run this api to see if all/the latest versions cause bags to be generated and sent to your archive location. (Can also try the list-only option described in the docs to just list which ones will be done and then confirm that the same list is actually done.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?: planning one release note across PRs

Additional documentation:

coveralls · 2022-04-13T22:05:45Z

Coverage decreased (-0.02%) to 19.712% when pulling e4a228d on TexasDigitalLibrary:TDL/7493-batch_archiving-only into af22d3f on IQSS:develop.

TDL/7493-batch_archiving-only

…ving-only

qqmyers · 2022-07-15T18:51:59Z

FWIW: tagging this as 3a since it includes changes to the archive api (POST vs GET, namechange) that we should have before trying to sync with 3A services again.

qqmyers · 2022-07-20T18:14:43Z

@scolapasta - 3A could/should have the changes to the API naming that is in the PR. I can pull that out elsewhere if we don't want to handle the new batch call now.

scolapasta · 2022-07-21T21:30:15Z

src/main/java/edu/harvard/iq/dataverse/DatasetVersionServiceBean.java

+     */
+    public List<DatasetVersion> getUnarchivedDatasetVersions(){
+
+        String queryString = "SELECT OBJECT(o) FROM DatasetVersion AS o WHERE o.releaseTime IS NOT NULL and o.archivalCopyLocation IS NULL";


Could we make this a named query?

scolapasta · 2022-07-21T21:31:54Z

src/main/java/edu/harvard/iq/dataverse/api/Admin.java

+            // Note - the user is being set in the session so it becomes part of the
+            // DataverseRequest and is sent to the back-end command where it is used to get
+            // the API Token which is then used to retrieve files (e.g. via S3 direct
+            // downloads) to create the Bag
            session.setUser(au); // TODO: Stop using session. Use createDataverseRequest instead.


should we make this change while we're in this code?

scolapasta · 2022-07-21T21:33:22Z

src/main/java/edu/harvard/iq/dataverse/api/Admin.java

+            // DataverseRequest and is sent to the back-end command where it is used to get
+            // the API Token which is then used to retrieve files (e.g. via S3 direct
+            // downloads) to create the Bag
+            session.setUser(au);


same as above, if we do decide to make the change

scolapasta · 2022-07-21T21:38:30Z

src/main/java/edu/harvard/iq/dataverse/api/Admin.java

+                String className = settingsService.getValueForKey(SettingsServiceBean.Key.ArchiverClassName);
+                AbstractSubmitToArchiveCommand cmd = ArchiverUtil.createSubmitToArchiveCommand(className, dvRequestService.getDataverseRequest(), dsl.get(0));
+                final DataverseRequest request = dvRequestService.getDataverseRequest();
+                if (cmd != null) {


could you explain what this line is doing? (i.e. why is a command being created before this and checked for null)

The create method is trying to read the property and instantiate the specified class (a subclass of AbstractSubmitToArchiveCommand) via reflection. If the class doesn't exist, it would fail/return null.

Ah gotcha ot could actually return null, then. Maybe just q quick comment to make that clear, since it's different than when we normally create a command?

scolapasta · 2022-07-21T21:39:17Z

src/main/java/edu/harvard/iq/dataverse/api/Admin.java

+                            logger.info("Archiving complete: " + successes + " Successes, " + failures + " Failures. See prior log messages for details.");
+                        }
+                    }).start();
+                    return ok("Archiving all unarchived published dataset versions using " + cmd.getClass().getCanonicalName() + ". Processing can take significant time for large datasets/ large numbers of dataset versions. View log and/or check archive for results.");


oh is it to havr the command here?

Not really - I could return the className string as done in the null case in the else clause.

scolapasta · 2022-07-21T21:41:23Z

doc/sphinx-guides/source/installation/config.rst

+
+The archiveAllUnarchivedDatasetVersions call takes 3 optional configuration parameters. 
+* listonly=true will cause the API to list dataset versions that would be archived but will not take any action.
+* limit=<n> will limit the number of dataset versions archived in one api call to <= <n>. 


I get how this works, but what's the reason to limit this way? (counting both successes and failures)

listonly=true gives you a list, so with limit working this way you can make sure that only the things you listed will get processed when you drop listonly=true.
Overall, the concern is about load, particularly if/when something is misconfigured and everything will fail after all the work to create a bag.

Are these guaranteed to work with the list in the same order each time (i.e. if something added, it would be added at the end, so limit is guaranteed to get the things from the last listAll)?

~yes - it's the new named query that is getting the list so unless something affects the return order from that (which I think will go in id order by default), it wouldn't change.

…ving-only

TDL/7493-batch_archiving-only

scolapasta · 2022-07-25T19:48:29Z

doc/sphinx-guides/source/installation/config.rst

+
+The archiveAllUnarchivedDatasetVersions call takes 3 optional configuration parameters. 
+* listonly=true will cause the API to list dataset versions that would be archived but will not take any action.
+* limit=<n> will limit the number of dataset versions archived in one api call to <= <n>. 


Are these guaranteed to work with the list in the same order each time (i.e. if something added, it would be added at the end, so limit is guaranteed to get the things from the last listAll)?

scolapasta · 2022-07-25T19:49:42Z

src/main/java/edu/harvard/iq/dataverse/DatasetVersionServiceBean.java

+
+        try {
+            @SuppressWarnings("unchecked")
+            List<DatasetVersion> dsl = em.createNamedQuery("DatasetVersion.findUnarchivedReleasedVersion").getResultList();


I think Named Query should still be able to take a type, si you don't need the SuppressWarnings.

scolapasta · 2022-07-25T19:50:32Z

src/main/java/edu/harvard/iq/dataverse/api/Admin.java

+                String className = settingsService.getValueForKey(SettingsServiceBean.Key.ArchiverClassName);
+                AbstractSubmitToArchiveCommand cmd = ArchiverUtil.createSubmitToArchiveCommand(className, dvRequestService.getDataverseRequest(), dsl.get(0));
+                final DataverseRequest request = dvRequestService.getDataverseRequest();
+                if (cmd != null) {


Ah gotcha ot could actually return null, then. Maybe just q quick comment to make that clear, since it's different than when we normally create a command?

…ving-only

scolapasta · 2022-07-26T19:59:47Z

src/main/java/edu/harvard/iq/dataverse/api/Admin.java

-			// the API Token which is then used to retrieve files (e.g. via S3 direct
-			// downloads) to create the Bag
-            session.setUser(au); // TODO: Stop using session. Use createDataverseRequest instead.
+            // Note - the user is being set in the session so it becomes part of the


just noticed I think this comment can go, no?

scolapasta · 2022-07-26T20:00:10Z

src/main/java/edu/harvard/iq/dataverse/api/Admin.java

+        try {
+            AuthenticatedUser au = findAuthenticatedUserOrDie();
+
+            // Note - the user is being set in the session so it becomes part of the


…ving-only

kcondon · 2022-08-05T17:36:43Z

Not seeing json output, logger needs to be fine rather than info, seeing null ptr at end of log output.

A system exception occurred during an invocation on EJB Admin, method: public javax.ws.rs.core.Response edu.harvard.iq.dataverse.api.Admin.archiveAllUnarchivedDatasetVersions(boolean,java.lang.Integer,boolean)]]

Caused by: java.lang.NullPointerException
at edu.harvard.iq.dataverse.DatasetVersion.getFriendlyVersionNumber(DatasetVersion.java:539)

Merge and revert BagGenerator/Duracloud archiver changes in other PRs

bf7b558

qqmyers added the GDCC: TDL supported by Texas Digital Library label Apr 13, 2022

This was referenced Apr 13, 2022

Archiving/Bag/OAI_ORE/DataCommons Release notes #8611

Closed

GDCC/8604 Improve archiver error handling #8612

Merged

qqmyers mentioned this pull request May 13, 2022

GDCC/8605-add-archival-status-support #8696

Merged

qqmyers added 9 commits May 24, 2022 16:55

Merge remote-tracking branch 'IQSS/develop' into

bc63cf8

TDL/7493-batch_archiving-only

Merge remote-tracking branch 'IQSS/develop' into TDL/7493-batch_archi…

b9d3fd1

…ving-only

Merge remote-tracking branch 'IQSS/develop' into TDL/7493-batch_archi…

9a90fe1

…ving-only

Merge remote-tracking branch 'IQSS/develop' into TDL/7493-batch_archi…

059bdfe

…ving-only

Merge remote-tracking branch 'IQSS/develop' into TDL/7493-batch_archi…

267297c

…ving-only

restore batch command

4c0d891

drop line removal

6ea4878

drop superuser req as this is admin and command already requires perm

5f4d965

add doc for batch archiving command

2357dd2

qqmyers added HDC Harvard Data Commons HDC: 3a Harvard Data Commons Obj. 3A labels Jul 15, 2022

clarify archival bag language

e2bb433

scolapasta self-assigned this Jul 20, 2022

scolapasta requested changes Jul 21, 2022

View reviewed changes

qqmyers added 4 commits July 21, 2022 18:40

Merge remote-tracking branch 'IQSS/develop' into TDL/7493-batch_archi…

df05c0e

…ving-only

Change to named query

5d3f6f5

fix toDos re: createDataverseRequest()

5fe18a7

remove call to set session user

86162de

Merge remote-tracking branch 'IQSS/develop' into

408a51f

TDL/7493-batch_archiving-only

scolapasta requested changes Jul 25, 2022

View reviewed changes

qqmyers added 3 commits July 25, 2022 16:00

use class

ccb8653

Merge remote-tracking branch 'IQSS/develop' into TDL/7493-batch_archi…

4484c61

…ving-only

Add comment per review

17baefc

scolapasta reviewed Jul 26, 2022

View reviewed changes

update/remove obsolete comments

82d9e45

scolapasta approved these changes Jul 26, 2022

View reviewed changes

scolapasta removed their assignment Jul 26, 2022

Merge remote-tracking branch 'IQSS/develop' into TDL/7493-batch_archi…

45ad628

…ving-only

kcondon self-assigned this Jul 29, 2022

qqmyers added 4 commits July 29, 2022 13:06

updates for archival status (missed/lost)

abd3923

typo

53dc116

Merge remote-tracking branch 'IQSS/develop' into TDL/7493-batch_archi…

e4a228d

…ving-only

Merge remote-tracking branch 'IQSS/develop' into TDL/7493-batch_archi…

c918e64

…ving-only

scolapasta added this to the 5.12 milestone Aug 5, 2022

kcondon assigned qqmyers and unassigned kcondon Aug 5, 2022

qqmyers added 2 commits August 5, 2022 14:13

don't archive harvested datasets

2eed04b

lower list-only logging to fine

bae7011

qqmyers removed their assignment Aug 8, 2022

doc /api response changes per QA

4215ec5

kcondon self-assigned this Aug 8, 2022

kcondon merged commit 04bfd3d into IQSS:develop Aug 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TDL/7493 Batch Archiving #8610

TDL/7493 Batch Archiving #8610

qqmyers commented Apr 13, 2022 •

edited

Loading

coveralls commented Apr 13, 2022 •

edited

Loading

qqmyers commented Jul 15, 2022

qqmyers commented Jul 20, 2022

scolapasta Jul 21, 2022

scolapasta Jul 21, 2022

scolapasta Jul 21, 2022

scolapasta Jul 21, 2022

qqmyers Jul 21, 2022

scolapasta Jul 25, 2022

scolapasta Jul 21, 2022

qqmyers Jul 21, 2022

scolapasta Jul 21, 2022

qqmyers Jul 21, 2022

scolapasta Jul 25, 2022

qqmyers Jul 25, 2022

scolapasta Jul 25, 2022

scolapasta Jul 25, 2022

scolapasta Jul 25, 2022

scolapasta Jul 26, 2022

scolapasta Jul 26, 2022

qqmyers Jul 26, 2022

kcondon commented Aug 5, 2022

TDL/7493 Batch Archiving #8610

TDL/7493 Batch Archiving #8610

Conversation

qqmyers commented Apr 13, 2022 • edited Loading

coveralls commented Apr 13, 2022 • edited Loading

qqmyers commented Jul 15, 2022

qqmyers commented Jul 20, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kcondon commented Aug 5, 2022

qqmyers commented Apr 13, 2022 •

edited

Loading

coveralls commented Apr 13, 2022 •

edited

Loading