Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harvesting: OAI sets are not updated when datasets are deleted. #8005

Closed
kcondon opened this issue Jul 20, 2021 · 5 comments
Closed

Harvesting: OAI sets are not updated when datasets are deleted. #8005

kcondon opened this issue Jul 20, 2021 · 5 comments
Labels
Feature: Harvesting NIH OTA DC Grant: The Harvard Dataverse repository: A generalist repository integrated with a Data Commons NIH OTA: 1.4.1 4 | 1.4.1 | Resolve OAI-PMH harvesting issues | 5 prdOwnThis is an item synched from the product ... pm.epic.nih_harvesting pm.GREI-d-1.4.1 NIH, yr1, aim4, task1: Resolve OAI-PMH harvesting issues pm.GREI-d-1.4.2 NIH, yr1, aim4, task2: Create working group on packaging standards pm.GREI-d-2.4.1B NIH AIM:4 YR:2 TASK:1B | 2.4.1B | (started yr1) Resolve OAI-PMH harvesting issues Size: 10 A percentage of a sprint. 7 hours.

Comments

@kcondon
Copy link
Contributor

kcondon commented Jul 20, 2021

In the OAI protocol there is the concept of updating the list of identifiers in a set when something is deleted. Currently, in 5.5 on demo, a successful harvest of the "no set" set, ie all datasets, correctly harvests the 133 published datasets but fails 2068 others that no longer exist in the db. We believe this is a result of our monthly auto cleaning via API (destroy). So, it seems like this is not updating the OAI sets, leading to phantom harvest records. Need to verify it is server side versus client side.

Update: confirmed that the oairecord for the failing globalid was marked as removed on the server in the db, so either it is not passed as deleted by oai or maybe more likely, client does not handled the removed status correctly.

@mreekie mreekie added pm.epic.nih_harvesting NIH OTA DC Grant: The Harvard Dataverse repository: A generalist repository integrated with a Data Commons labels May 9, 2022
@mreekie mreekie added NIH OTA: 1.4.1 4 | 1.4.1 | Resolve OAI-PMH harvesting issues | 5 prdOwnThis is an item synched from the product ... and removed NIH OTA: 1.4.1 4 | 1.4.1 | Resolve OAI-PMH harvesting issues | 5 prdOwnThis is an item synched from the product ... labels Oct 25, 2022
@mreekie mreekie removed the NIH OTA: 1.4.1 4 | 1.4.1 | Resolve OAI-PMH harvesting issues | 5 prdOwnThis is an item synched from the product ... label Nov 2, 2022
@mreekie mreekie added the NIH OTA: 1.4.1 4 | 1.4.1 | Resolve OAI-PMH harvesting issues | 5 prdOwnThis is an item synched from the product ... label Dec 5, 2022
@mreekie
Copy link

mreekie commented Jan 9, 2023

Review with Leonid

  • good candidate
  • Get this estimated and prioritized

@mreekie mreekie added pm.GREI-d-1.4.1 NIH, yr1, aim4, task1: Resolve OAI-PMH harvesting issues pm.GREI-d-1.4.2 NIH, yr1, aim4, task2: Create working group on packaging standards labels Mar 20, 2023
@cmbz cmbz added the pm.GREI-d-2.4.1B NIH AIM:4 YR:2 TASK:1B | 2.4.1B | (started yr1) Resolve OAI-PMH harvesting issues label Jun 2, 2023
@cmbz
Copy link

cmbz commented Dec 19, 2023

2023/12/19: Prioritized during meeting on 2023/12/18. Added to Needs Sizing.

@landreev
Copy link
Contributor

This is an old (3 y.o.) issue. And also, note the update in the opening comment suggests that the problem may have been a fluke on the client side all along.
The OAI implementation has been updated since the issue was opened (by switching to the new-and-much-improved xoai library).
So, in short, there's a very good chance that this is no longer a problem and the issue can be closed. But I'm going to give it a 10 just in case.

@landreev landreev added the Size: 10 A percentage of a sprint. 7 hours. label Dec 19, 2023
@landreev landreev self-assigned this Jan 4, 2024
@landreev
Copy link
Contributor

landreev commented Jan 5, 2024

What I said in the last comment appears to be correct, this is no longer an issue.
This can be confirmed by harvesting from demo:
oai url: https://demo.dataverse.org/oai
oai set: none (i.e., the default, "everything" set)
format: oai_dc

The main set from demo is a perfect test for this, because we continuously purge most of the datasets created there, as described in the opening comment. Per OAI-PMH specifications, the OAI records for these datasets are kept, so that we can communicate to any clients that may have harvested them that they no longer exist. So that they know that they need to remove them on their end as well.
Harvesting the above should (as of writing this) result in ~2,600 successfully harvested records and a few (a single digit number of) failures (on account of some invalid metadata - this is demo.dataverse.org after all). This matches the number of published and successfully exported datasets that currently exist there. Looking at the output of oai?verb=ListIdentifiers&metadataPrefix=oai_dc you can see that ~3,000 more records are listed, all marked with status="deleted". Our (Dataverse) OAI client handles these properly as well - skipping if it doesn't have any records with the DOI, deleting if it does.

@landreev landreev closed this as completed Jan 5, 2024
@landreev
Copy link
Contributor

landreev commented Jan 5, 2024

P.S.

... and a few (a single digit number of) failures (on account of some invalid metadata

Actually, scratch that - all the oai_dc records on demo are valid.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature: Harvesting NIH OTA DC Grant: The Harvard Dataverse repository: A generalist repository integrated with a Data Commons NIH OTA: 1.4.1 4 | 1.4.1 | Resolve OAI-PMH harvesting issues | 5 prdOwnThis is an item synched from the product ... pm.epic.nih_harvesting pm.GREI-d-1.4.1 NIH, yr1, aim4, task1: Resolve OAI-PMH harvesting issues pm.GREI-d-1.4.2 NIH, yr1, aim4, task2: Create working group on packaging standards pm.GREI-d-2.4.1B NIH AIM:4 YR:2 TASK:1B | 2.4.1B | (started yr1) Resolve OAI-PMH harvesting issues Size: 10 A percentage of a sprint. 7 hours.
Projects
None yet
Development

No branches or pull requests

5 participants