Skip to content

Add support for archiving/restoring relationships via AIPs / packager#3328

Open
ConfusionOrb221 wants to merge 28 commits into
DSpace:mainfrom
atmire:w2p-80200_Retain-Rels
Open

Add support for archiving/restoring relationships via AIPs / packager#3328
ConfusionOrb221 wants to merge 28 commits into
DSpace:mainfrom
atmire:w2p-80200_Retain-Rels

Conversation

@ConfusionOrb221
Copy link
Copy Markdown
Contributor

@ConfusionOrb221 ConfusionOrb221 commented Jul 19, 2021

Related issues

Description

This includes relationship information in METS AIPs created by the packager. They are stored in both directions, and there is a dedicated portion of METS, a new structMap, with id "RELS", for holding relationship information. Virtual metadata is also retained, but alongside concrete metadata as it was before.

I have provided an updated documentation to the lyrasis in a gist here: https://gist.github.com/ConfusionOrb221/c38be37c3c80307bf0ee306e36c87fcf

Usage

When this change is part of the DSpace code, exporting items with relationships will automatically be done. There are no separate options for export.

For imports, it is possible to specify a "scope", which describes the relationships to follow (to restore related items) for the restore operation. This is necessary because it will sometimes be required that two related items need to be restored. They cannot be restored independently because a) we assume the items must be fully restored in a single restore operation, and b) when an item effectively depends on another to exist (via a relationship to it), it must be part of the same restore operation.

Scope can be specified with -z, which defaults to '*', which means all relationships will be followed one level out. If any of the referred items do not exist in the repository, it attempts to restore them as long as they are directly related to the original item being restored (this is analogous to how the "recursive" option has historically worked when restoring collections).

Scope syntax is any number of the following, separated by commas:

REL_LABEL[:recursive]

Where: REL_LABEL is the relationship to follow (* means all), and recursive, if present, means to follow it forever (loops are detected and avoided if present).

Testing

The easiest way to see the feature in action is to do a manual packager export/import of two related items in your repository, which aren't related to anything else (to keep it simple).

You should see the new structMap sections. You can remove the items from DSpace and then restore them.

Some ITs have been added to exercise this functionality.

@cwilper cwilper added the work in progress PR is still being worked on & is not currently ready for review label Jul 19, 2021
@cwilper cwilper changed the title W2p 80200 retain rels Address issue #2882: Support for archiving/restoring relationships via packager Jul 19, 2021
@lgtm-com
Copy link
Copy Markdown

lgtm-com Bot commented Jul 19, 2021

This pull request introduces 6 alerts when merging c87f85b into 94e18cb - view on LGTM.com

new alerts:

  • 4 for Dereferenced variable may be null
  • 1 for Useless null check
  • 1 for Potential input resource leak

@cwilper cwilper self-requested a review July 19, 2021 18:24
Copy link
Copy Markdown
Contributor

@cwilper cwilper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two general areas that should be addressed:

  • Non-functional (code formatting) changes in a few places...avoid these so review can focus on the functionality (but stay within checkstyle guidelines)
  • Need javadocs on any public methods (saw some missing on a new class, there may be others)

Test coverage could also be increased to cover "scope":

  • When provided, it should be respected
  • When not provided, it should obey the default behavior (which should be different from the test case for the above)

@lgtm-com
Copy link
Copy Markdown

lgtm-com Bot commented Jul 19, 2021

This pull request fixes 1 alert when merging d1e7dfd into 94e18cb - view on LGTM.com

fixed alerts:

  • 1 for Potential input resource leak

@github-actions github-actions Bot added the merge conflict PR has a merge conflict that needs resolution label Oct 7, 2021
@github-actions github-actions Bot removed the merge conflict PR has a merge conflict that needs resolution label Feb 10, 2026
@tdonohue tdonohue added new feature interface: REST API v7+ REST API for v7 and later (dspace-server-webapp module) labels Feb 10, 2026
@mdiggory mdiggory self-requested a review March 31, 2026 16:13
@mdiggory mdiggory removed the status in DSpace 10.0 Release Mar 31, 2026
@mdiggory mdiggory moved this to 👍 Reviewer Approved in DSpace 10.0 Release Mar 31, 2026
@tdonohue
Copy link
Copy Markdown
Member

tdonohue commented Apr 1, 2026

@mdiggory and @ConfusionOrb221 : Thanks for your continued work on this. While this PR has missed the "new feature" merger deadline for DSpace 10.0 (which was last Friday), I'm going to talk with developers/Committers in our Dev Mtg tomorrow about whether to recategorize this PR as a bug fix.

The reason for potentially doing so is that the AIP functionality has been flawed (since 7.x) because it doesn't work properly for Entities and their relationships. Therefore, it seems reasonable (to me) that, assuming this PR fixes the flaws, we might want to categorize this as a bug fix and potentially backport it to 9.x/8.x/7.x. However, that decision will require discussion in a DevMtg.

If this PR is recategorized as a "bug fix", then I'll update the labels and we'll work to find additional reviewers/testers to try to get it into 10.0 before the bug-fix deadline (May 22). Otherwise, unfortunately, this will miss the 10.0 release because it was not reviewed/merged based on our deadlines (as it requires two +1 votes to merge and it's only currently at +1).

I'll post updates after the discussion tomorrow.

@github-actions github-actions Bot added the merge conflict PR has a merge conflict that needs resolution label Apr 1, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 1, 2026

Hi @ConfusionOrb221,
Conflicts have been detected against the base branch.
Please resolve these conflicts as soon as you can. Thanks!

@github-actions github-actions Bot removed the merge conflict PR has a merge conflict that needs resolution label Apr 1, 2026
@tdonohue tdonohue self-requested a review April 2, 2026 14:33
@tdonohue tdonohue added bug and removed new feature labels Apr 2, 2026
@tdonohue
Copy link
Copy Markdown
Member

tdonohue commented Apr 2, 2026

Per discussion in today's Developers Meeting, this has been recategorized as a "bug fix" because AIP Backup & Restore is currently broken for Configurable Entities. That means this PR can still be included in 10.0 (and potentially backported) provided that it gets additional testers/reviewers (as it requires two +1 votes) prior to the 10.0 bug fix merger deadline of May 22.

@mdiggory
Copy link
Copy Markdown
Member

mdiggory commented Apr 9, 2026

@tdonohue, this is good news. Thank you. How to promote this to other reviewers?

@tdonohue
Copy link
Copy Markdown
Member

tdonohue commented Apr 9, 2026

@mdiggory : I've been trying to promote this in weekly developer meetings, but haven't found other volunteers yet. So, I'd recommend that you also try to promote it on Slack or Committers list or similar. It might even be possible to find more testers by providing some example ways to test this more easily (e.g. provide some example step by step instructions in the PR description, as the "Testing" section in the PR description is currently very vague.)

I'm also planning on testing it myself, but I've not yet been able to find time for it. This is of interest to me though as I'd like to be able to potentially restore all the demo entities data used on https://demo.dspace.org and https://sandbox.dspace.org via AIP Backup & Restore.

@tdonohue
Copy link
Copy Markdown
Member

tdonohue commented Apr 21, 2026

@mdiggory and @ConfusionOrb221 : I'm trying to run some tests today with my local test instance. It has a lot of test Entities, mostly based on our Demo Entity Dataset. This is the same demo data that we use on https://sandbox.dspace.org and https://demo.dspace.org

While I'm able to successfully export to AIPs (and see the new relationships represented in the mets.xml), I'm having a lot of issues with re-importing content. There's error after error. I'm not even confident they are related to this PR, but they seem to imply that our AIP import is currently not functioning properly.

Here's what I've tried to do:

  1. Export all my data to AIPs
    [dspace]/bin/dspace packager -d -a -t AIP -e dspacedemo+admin@gmail.com -i 123456789/0 sitewide-aip.zip
    
  2. Spin up a fresh instance of DSpace in Docker with an empty database
  3. Create a single admin user in that new DSpace instance
  4. Attempt to reimport everything in "replace mode"
    [dspace]/bin/dspace packager -r -a -f -t AIP -e dspacedemo+admin@gmail.com sitewide-aip.zip
    

The import immediately fails, but it appears possibly unrelated to this PR? It fails with a CrosswalkInternalException saying that it cannot find one of my Administrative users defined in the mets.xml file in my sitewide AIP.

I currently cannot seem to find a way around this, but it's possible it's an issue with my local data or configuration (If I manage to figure out a way to fix my data, I'll let you know)

Have either of you successfully used this PR to export everything out of a DSpace site (especially one with our Demo Entities Dataset) and restore it into an empty site? In other words, has this PR been tested at the site-wide export/import level, or just for individual objects?

@ConfusionOrb221
Copy link
Copy Markdown
Contributor Author

@tdonohue
I've only ever been testing this with the database I exported from never a cleanly fresh database. I would delete items locally and relationships and try to import like that not a fresh one. Although like you said I cant imagine my pr code is causing the issue you described as it doesnt touch the way it would interact with user groups.

@tdonohue
Copy link
Copy Markdown
Member

@ConfusionOrb221 : I understand. My challenge is that I want to give this a thorough test by doing a full restoration on an empty database. Currently, I cannot do that....but, I'm not convinced yet that it's the fault of this PR. There almost seems to be a lot of "fragile" code in the AIP Backup and Restore that seems to make this complete restoration difficult.

This full restoration process used to work. It's just no longer working for me and I'm wrestling with it so that I can thoroughly test this PR.

Copy link
Copy Markdown
Member

@tdonohue tdonohue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ConfusionOrb221 and @mdiggory : I've finally been able to get basic site-wide export and import of AIPs working again (though I needed to fix several small bugs in AIP backup & restore as detailed in #12343).

Because of that, I've been able to get back to testing this PR (with my #12343 in place as well).

So far, I've not had luck in restoring relationships between AIPs even though the AIPs generated do all have the <structMap ID="rels"> element in the METS which stores the relationships.

I have this same issue if I'm restoring an entire site, or just a single Community (with a number of related Entities).

Here's an example AIP export of just a single Community (which corresponds to this community on our Sandbox)

demo-dcat-journals.tar.gz

I'm restoring this content by doing this:

tar -xvzf demo-dcat-journals.tar.gz
# That puts everything in a `tmp` directory
cd tmp
# Now, import it all (starting with the community) as a user named "test@test.edu"
[dspace]/bin/dspace packager -r -a -f -t AIP -e test@test.edu -o skipIfParentMissing=true COMMUNITY\@123456789-1119.zip

The results that I see is that everything imports properly with the proper entity type. However, none of the relationships are restored. Any ideas? Could you see if this small set of data works for you?

@ConfusionOrb221
Copy link
Copy Markdown
Contributor Author

ConfusionOrb221 commented Apr 27, 2026

@tdonohue I'll have to check exactly myself but part of the default -z operation that would run here (as none was provided) would cause it to only create rels for items that currently exist in the repository. This is because we didnt want to have a runoff where you ingest an item and it starts creating 1000's of related items (items related to the item you are ingesting then items related to those ingested relationships etc) or you ingest a community and you end up affecting other collections or communities by forcing an overwrite of those items aswell. This would also ensure that we dont run over other items that aren't up to date in the current folder/zip aswell. If you want to force this to happen you would have to provide -z all:r to tell it to recursively search for and restore those packages aswell.

@tdonohue
Copy link
Copy Markdown
Member

@ConfusionOrb221 : I tried adding the -z all:r parameter that you suggested, and I ended up with the same results. The Entities themselves are restored, but no relationships are recreated.

Here's the full command that I ran:

[dspace]/bin/dspace packager -r -a -f -t AIP -e test@test.edu -z all:r -o skipIfParentMissing=true COMMUNITY\@123456789-1119.zip

As a sidenote, I'm having a difficult time understanding the expected behavior of the new -z parameter. I think your documentation needs more examples. For instance, it doesn't even mention usage of -z all:r and how it would behave differently from the default value of -z all. I think we need to be clear which AIP restore scenarios may require different flags.

For instance:

  1. It sounds like whenever you are restoring an entire site, you really MUST always add -z all:r or else the restoration would never include restoring relationships (which would be unexpected behavior if you are trying to restore everything)
  2. I'm also not entirely sure I understand why relationships need the :r setting to flag recursive mode. Shouldn't we just use the -a flag that already exists to decide whether to recursively add more objects or not? If -a isn't passed in, then -z all should default to just restoring relationships in a non-recursive manner. But, if -a is passed in, then it should restore relationships in a recursive manner. (Is there a reason why the -a flag isn't sufficient here, since it already is meant to represent "recursive" mode?)

Personally, I'm also questioning why we are not restoring all missing relationships by default. It seems odd to default to only restoring relationships if the object they reference already exists. With AIPs in general, the policy in the past has been to restore all missing data by default. So, this seems like the first scenario were we are defaulting to not restoring something that existed before (i.e. the relationships).

If it's possible to do so, I'd recommend we rethink this approach as I would expect all missing relationships to be restored when you restore data via AIPs. Though, this could also be solved if we respected the -a flag for relationships...when that flag is passed, we'd restore recursively...when it's not passed in, we'd just restore relationships to existing objects.

@mdiggory
Copy link
Copy Markdown
Member

mdiggory commented May 7, 2026

@tdonohue, we have taken the approach that it's important to keep a logical separation between the -r -a functionality and the -z functionality. This ensures that the original CLI and capability stay backward compatible

  • reserving the -r -a for site/community/collection/item traversal
  • reserving the -z functionality for traversing relationships

I believe there is some ambiguity about the default behavior when the -z flag is not used and when it is present. I will try to clarify the behavior below

Dissemination

Dissemination always includes relationships in mets.xml

  • -z or -z all: traverses all immediately related items and disseminates them too (not their related items)
  • -z isPublicationOfOrgUnit: source item is an OrgUnit, traverses only isPublicationOfOrgUnit disseminates the Publications as well.
  • -z isPublicationOfOrgUnit,isAuthorOfPublication:r: source item is an OrgUnit, traverses only isPublicationOfOrgUnit disseminates the Publications, then traverses isAuthorOfPublication to disseminate all the Authors.
  • -z all:r we more than likely don't want to have this as an option, as it suggests we are disseminating the entire repository, and there are capabilities already present in `-r -a ' to achieve that.

Ingest

Ingest always attempts to restore "relationships" from the mets. If the target item does not exist, skip the relationship

  • -z or -z all: attempt to restore all target items of the current item, ensuring all relationships in the original item are restored (no recursion into target items' relationships)
  • -z isPublicationOfOrgUnit: source item is an OrgUnit, traverses only isPublicationOfOrgUnit and ingests those Publications as well
  • -z isPublicationOfOrgUnit,isAuthorOfPublication:r: source item is an OrgUnit, this traverses only isPublicationOfOrgUnit and ingests those Publications, it then traverses isAuthorOfPublication to ingest all the Authors.
  • -z all:r we more than likely don't want to have this as an option, as it suggests we are ingesting the entire repository, and there are capabilities already present in `-r -a ' to achieve that.

Syntax considerations

I know that the :r syntax can be ambiguous regarding which targets should be considered; this is why we have a problem with all:r meaning everything in the repo. There may be other graph traversal syntaxes that are more capable; using / instead of :r may be more intuitive and easier to manage. For example:

-z isPublicationOfOrgUnit/isAuthorOfPublication: for identified orgUnit, restore all publications and their authors
-z isPublicationOfOrgUnit/all: for identified orgUnit, restore all publications and all their related items

If this becomes a release roadblock, we can exclude the recursion or path-traversal topic and address it in a later release after further refinement of the requirements.

@tdonohue
Copy link
Copy Markdown
Member

tdonohue commented May 7, 2026

@mdiggory : While I appreciate the details, I'm still not able to get this PR to function as I expected it to function. Maybe I'm running the wrong commands, or maybe my assumptions are incorrect. But, I feel like I don't know how to get this PR to work properly when it comes to restoring an entire Community (of Entities and relationships) or an entire Site (again with all Entities and relationships). Both of these are basic features of AIP Backup and Restore, and we want to ensure they also work for Entities & relationships, obviously.

As I documented in my comments above (see #3328 (review) and #3328 (comment)), I've been unsuccessful in my tests of restoring an entire Site or an entire Community and seeing the relationships also restored. It's possible I'm missing the right flags, but I've tried a number of combinations based on how it appears the -z flag should work, and they've all be unsuccessful so far.

Overall, I'm worried this PR is not yet stable enough for a release...it looks like it should be promising, but I cannot get it to function as it's described. Unfortunately, we are running very short on time for 10.0 (as code freeze is coming on May 22) and my time will be less available to continue to review & try to debug this PR in the coming weeks. I've yet to find another volunteer reviewer as well.

So, if you or @ConfusionOrb221 are able to determine how to successfully restore the Entities & Relationships in the community export that I shared in this comment (or a similar Community export), that'd be beneficial to this PR moving forward for 10.0.

While I understand there's a ton of new options with the -z flag, I'm finding I don't have clear instructions for how to perform a full Site or full Community export and reimport such that relationships are also restored. So, I'd appreciate help in drafting example commands for various export and import scenarios like full Site, single Community or single Collection restoration, along with the single Entity/Item restoration.

Thanks!

@mdiggory
Copy link
Copy Markdown
Member

mdiggory commented May 8, 2026

@tdonohue, I appreciate your feedback, and please don't take my response to mean we are not working to resolve the issues you are encountering on our side. We are scheduling to address them in the next several days.

@tdonohue
Copy link
Copy Markdown
Member

tdonohue commented May 8, 2026

@mdiggory : Thanks for clarifying.

From my perspective, I do also understand the usefulness of the -z param. My only concern was specifically with the :r portion of that command, as it seems (to me at least) like we could use -a instead (as that always runs recursively).

In other words, I'd expect something like this:

  • -r -a -z => would recursively restore the passed in object along with all relationships and related/referenced objects (as -z defaults to -z all).
  • -r -a -z isPublicationOfOrgUnit => Assuming source is an OrgUnit, recursively restore all objects but only following & restoring that isPublicationOfOrgUnit relationship.
  • -r -z => would only restore the existing object (no recursion) but would attempt to restore any relationships to other existing Entities (non-existing entities would not be restored, nor their relationships, as this is not recursive)
  • -r -z isPublicationOfOrgUnit => Assuming source is an OrgUnit, restore this current object, but ONLY restore isPublicationOfOrgUnit` relationships to any existing Publication Entities (non-existing entities would not be restored, nor their relationships, as this is not recursive).

That said, if you all think of a better approach, I'm open to other ideas. I just don't like having to specify recursion twice (e.g. -r -a -z all:r looks odd because we already said this restore is recursive via the -a flag, so why do we need to also have :r?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug component: configurable entities Related to Configurable Entities feature high priority interface: REST API v7+ REST API for v7 and later (dspace-server-webapp module) tools: packager Related to package or AIP importer/exporter

Projects

Status: 👍 Reviewer Approved

Development

Successfully merging this pull request may close these issues.

Export/import Relationships via AIP Backup & Restore feature

5 participants