Add support for archiving/restoring relationships via AIPs / packager#3328
Add support for archiving/restoring relationships via AIPs / packager#3328ConfusionOrb221 wants to merge 28 commits into
Conversation
|
This pull request introduces 6 alerts when merging c87f85b into 94e18cb - view on LGTM.com new alerts:
|
cwilper
left a comment
There was a problem hiding this comment.
Two general areas that should be addressed:
- Non-functional (code formatting) changes in a few places...avoid these so review can focus on the functionality (but stay within checkstyle guidelines)
- Need javadocs on any public methods (saw some missing on a new class, there may be others)
Test coverage could also be increased to cover "scope":
- When provided, it should be respected
- When not provided, it should obey the default behavior (which should be different from the test case for the above)
|
This pull request fixes 1 alert when merging d1e7dfd into 94e18cb - view on LGTM.com fixed alerts:
|
…138096_Retain-Rels-7.6
… w2p-138096_backport_Retain-Rels-9.0
542570c to
0e68890
Compare
|
@mdiggory and @ConfusionOrb221 : Thanks for your continued work on this. While this PR has missed the "new feature" merger deadline for DSpace 10.0 (which was last Friday), I'm going to talk with developers/Committers in our Dev Mtg tomorrow about whether to recategorize this PR as a bug fix. The reason for potentially doing so is that the AIP functionality has been flawed (since 7.x) because it doesn't work properly for Entities and their relationships. Therefore, it seems reasonable (to me) that, assuming this PR fixes the flaws, we might want to categorize this as a bug fix and potentially backport it to 9.x/8.x/7.x. However, that decision will require discussion in a DevMtg. If this PR is recategorized as a "bug fix", then I'll update the labels and we'll work to find additional reviewers/testers to try to get it into 10.0 before the bug-fix deadline (May 22). Otherwise, unfortunately, this will miss the 10.0 release because it was not reviewed/merged based on our deadlines (as it requires two +1 votes to merge and it's only currently at +1). I'll post updates after the discussion tomorrow. |
|
Hi @ConfusionOrb221, |
|
Per discussion in today's Developers Meeting, this has been recategorized as a "bug fix" because AIP Backup & Restore is currently broken for Configurable Entities. That means this PR can still be included in 10.0 (and potentially backported) provided that it gets additional testers/reviewers (as it requires two +1 votes) prior to the 10.0 bug fix merger deadline of May 22. |
|
@tdonohue, this is good news. Thank you. How to promote this to other reviewers? |
|
@mdiggory : I've been trying to promote this in weekly developer meetings, but haven't found other volunteers yet. So, I'd recommend that you also try to promote it on Slack or Committers list or similar. It might even be possible to find more testers by providing some example ways to test this more easily (e.g. provide some example step by step instructions in the PR description, as the "Testing" section in the PR description is currently very vague.) I'm also planning on testing it myself, but I've not yet been able to find time for it. This is of interest to me though as I'd like to be able to potentially restore all the demo entities data used on https://demo.dspace.org and https://sandbox.dspace.org via AIP Backup & Restore. |
|
@mdiggory and @ConfusionOrb221 : I'm trying to run some tests today with my local test instance. It has a lot of test Entities, mostly based on our Demo Entity Dataset. This is the same demo data that we use on https://sandbox.dspace.org and https://demo.dspace.org While I'm able to successfully export to AIPs (and see the new relationships represented in the Here's what I've tried to do:
The import immediately fails, but it appears possibly unrelated to this PR? It fails with a I currently cannot seem to find a way around this, but it's possible it's an issue with my local data or configuration (If I manage to figure out a way to fix my data, I'll let you know) Have either of you successfully used this PR to export everything out of a DSpace site (especially one with our Demo Entities Dataset) and restore it into an empty site? In other words, has this PR been tested at the site-wide export/import level, or just for individual objects? |
|
@tdonohue |
|
@ConfusionOrb221 : I understand. My challenge is that I want to give this a thorough test by doing a full restoration on an empty database. Currently, I cannot do that....but, I'm not convinced yet that it's the fault of this PR. There almost seems to be a lot of "fragile" code in the AIP Backup and Restore that seems to make this complete restoration difficult. This full restoration process used to work. It's just no longer working for me and I'm wrestling with it so that I can thoroughly test this PR. |
There was a problem hiding this comment.
@ConfusionOrb221 and @mdiggory : I've finally been able to get basic site-wide export and import of AIPs working again (though I needed to fix several small bugs in AIP backup & restore as detailed in #12343).
Because of that, I've been able to get back to testing this PR (with my #12343 in place as well).
So far, I've not had luck in restoring relationships between AIPs even though the AIPs generated do all have the <structMap ID="rels"> element in the METS which stores the relationships.
I have this same issue if I'm restoring an entire site, or just a single Community (with a number of related Entities).
Here's an example AIP export of just a single Community (which corresponds to this community on our Sandbox)
I'm restoring this content by doing this:
tar -xvzf demo-dcat-journals.tar.gz
# That puts everything in a `tmp` directory
cd tmp
# Now, import it all (starting with the community) as a user named "test@test.edu"
[dspace]/bin/dspace packager -r -a -f -t AIP -e test@test.edu -o skipIfParentMissing=true COMMUNITY\@123456789-1119.zip
The results that I see is that everything imports properly with the proper entity type. However, none of the relationships are restored. Any ideas? Could you see if this small set of data works for you?
|
@tdonohue I'll have to check exactly myself but part of the default -z operation that would run here (as none was provided) would cause it to only create rels for items that currently exist in the repository. This is because we didnt want to have a runoff where you ingest an item and it starts creating 1000's of related items (items related to the item you are ingesting then items related to those ingested relationships etc) or you ingest a community and you end up affecting other collections or communities by forcing an overwrite of those items aswell. This would also ensure that we dont run over other items that aren't up to date in the current folder/zip aswell. If you want to force this to happen you would have to provide -z all:r to tell it to recursively search for and restore those packages aswell. |
|
@ConfusionOrb221 : I tried adding the Here's the full command that I ran: As a sidenote, I'm having a difficult time understanding the expected behavior of the new For instance:
Personally, I'm also questioning why we are not restoring all missing relationships by default. It seems odd to default to only restoring relationships if the object they reference already exists. With AIPs in general, the policy in the past has been to restore all missing data by default. So, this seems like the first scenario were we are defaulting to not restoring something that existed before (i.e. the relationships). If it's possible to do so, I'd recommend we rethink this approach as I would expect all missing relationships to be restored when you restore data via AIPs. Though, this could also be solved if we respected the |
|
@tdonohue, we have taken the approach that it's important to keep a logical separation between the
I believe there is some ambiguity about the default behavior when the -z flag is not used and when it is present. I will try to clarify the behavior below Dissemination Dissemination always includes relationships in mets.xml
Ingest Ingest always attempts to restore "relationships" from the mets. If the target item does not exist, skip the relationship
Syntax considerations I know that the
If this becomes a release roadblock, we can exclude the recursion or path-traversal topic and address it in a later release after further refinement of the requirements. |
|
@mdiggory : While I appreciate the details, I'm still not able to get this PR to function as I expected it to function. Maybe I'm running the wrong commands, or maybe my assumptions are incorrect. But, I feel like I don't know how to get this PR to work properly when it comes to restoring an entire Community (of Entities and relationships) or an entire Site (again with all Entities and relationships). Both of these are basic features of AIP Backup and Restore, and we want to ensure they also work for Entities & relationships, obviously. As I documented in my comments above (see #3328 (review) and #3328 (comment)), I've been unsuccessful in my tests of restoring an entire Site or an entire Community and seeing the relationships also restored. It's possible I'm missing the right flags, but I've tried a number of combinations based on how it appears the Overall, I'm worried this PR is not yet stable enough for a release...it looks like it should be promising, but I cannot get it to function as it's described. Unfortunately, we are running very short on time for 10.0 (as code freeze is coming on May 22) and my time will be less available to continue to review & try to debug this PR in the coming weeks. I've yet to find another volunteer reviewer as well. So, if you or @ConfusionOrb221 are able to determine how to successfully restore the Entities & Relationships in the community export that I shared in this comment (or a similar Community export), that'd be beneficial to this PR moving forward for 10.0. While I understand there's a ton of new options with the Thanks! |
|
@tdonohue, I appreciate your feedback, and please don't take my response to mean we are not working to resolve the issues you are encountering on our side. We are scheduling to address them in the next several days. |
|
@mdiggory : Thanks for clarifying. From my perspective, I do also understand the usefulness of the In other words, I'd expect something like this:
That said, if you all think of a better approach, I'm open to other ideas. I just don't like having to specify recursion twice (e.g. |
Related issues
Description
This includes relationship information in METS AIPs created by the packager. They are stored in both directions, and there is a dedicated portion of METS, a new structMap, with id "RELS", for holding relationship information. Virtual metadata is also retained, but alongside concrete metadata as it was before.
I have provided an updated documentation to the lyrasis in a gist here: https://gist.github.com/ConfusionOrb221/c38be37c3c80307bf0ee306e36c87fcf
Usage
When this change is part of the DSpace code, exporting items with relationships will automatically be done. There are no separate options for export.
For imports, it is possible to specify a "scope", which describes the relationships to follow (to restore related items) for the restore operation. This is necessary because it will sometimes be required that two related items need to be restored. They cannot be restored independently because a) we assume the items must be fully restored in a single restore operation, and b) when an item effectively depends on another to exist (via a relationship to it), it must be part of the same restore operation.
Scope can be specified with -z, which defaults to '*', which means all relationships will be followed one level out. If any of the referred items do not exist in the repository, it attempts to restore them as long as they are directly related to the original item being restored (this is analogous to how the "recursive" option has historically worked when restoring collections).
Scope syntax is any number of the following, separated by commas:
Where:
REL_LABELis the relationship to follow (*means all), and recursive, if present, means to follow it forever (loops are detected and avoided if present).Testing
The easiest way to see the feature in action is to do a manual packager export/import of two related items in your repository, which aren't related to anything else (to keep it simple).
You should see the new structMap sections. You can remove the items from DSpace and then restore them.
Some ITs have been added to exercise this functionality.