Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Purge all the stale Nesstar harvested dataverses in production #153

Closed
landreev opened this issue Apr 20, 2022 · 9 comments
Closed

Purge all the stale Nesstar harvested dataverses in production #153

landreev opened this issue Apr 20, 2022 · 9 comments
Assignees
Labels
Feature: Harvesting NIH GREI General work related to any of the NIH GREI aims Size: 10 A percentage of a sprint.

Comments

@landreev
Copy link
Collaborator

Short version: This is very stale content that we have no means to refresh or to serve meaningfully. All these harvested objects do is pad our dataset counts. But they are more of an embarrassment than they are worth by now, IMO.

History: we haven't supported harvesting from Nesstar repositories since v. 4 (!!!). The Nesstar-based harvesting clients and the corresponding dataverses and harvested datasets we have in production were grandfathered from DVN v3 via database migration. That was done on the assumption that we would add Nesstar support or otherwise revisit the issue sometime soon (sigh). Nesstar as a system has been completely reimplemented since then, so if we want to harvest content from these repositories in the future it would need to be reimplemented from scratch on our end. A lot of this content is completely stale by now.

There is some evidence that these harvesting clients and the corresponding dataverses cannot be removed using the normal client manager: #142. So purging them may require some manual API and/or database work (just like creating them did early on).

@landreev landreev changed the title Purge all the stale Nesstar harveted dataverses in production Purge all the stale Nesstar harvested dataverses in production Apr 20, 2022
@jggautier
Copy link
Collaborator

jggautier commented Aug 8, 2022

Laura Huis in 't Veld from DANS-KNAW asked today if we could remove the records harvested at https://dataverse.harvard.edu/dataverse/dans, which were harvested from a Nesstar repository. The email is at https://help.hmdc.harvard.edu/Ticket/Display.html?id=324230. I let them know that we'll work on it.

On the Harvard repository's Manage Clients page, there's no longer an entry for this client.

@cmbz cmbz added pm.GREI-d-2.4.1 NIH, yr2, aim4, task1: Implement packaging standards based on working group feedback NIH GREI General work related to any of the NIH GREI aims Feature: Harvesting and removed pm.GREI-d-2.4.1 NIH, yr2, aim4, task1: Implement packaging standards based on working group feedback labels Dec 18, 2023
@cmbz
Copy link
Collaborator

cmbz commented Dec 19, 2023

2023/12/19: Prioritized during meeting on 2023/12/18. Added to Needs Sizing.

@cmbz cmbz added the Size: 10 A percentage of a sprint. label Dec 19, 2023
@cmbz
Copy link
Collaborator

cmbz commented Dec 19, 2023

2023/12/19: Sized at 10 during sizing meeting.

@landreev landreev self-assigned this Jan 17, 2024
@landreev
Copy link
Collaborator Author

For the practical purposes of our users seeing these records in the search results, they have all been "purged" already. In that they were all dropped from the Solr index when 6.0 was deployed. But they were still sitting in the database.

They are being deleted now. It's just going to take some time, since there doesn't appear to be any better/faster way, other than with the Destroy api, one by one.

@landreev
Copy link
Collaborator Author

Confirming that all the nesstar-harvested records were deleted overnight.
I could just drag this to "Done", or "Merged" as it's called now. Since it's not something that can be reviewed or QA-ed in our normal sense; and there's nothing to merge, of course.
Just verifying that the datasets are no longer in the database could be both a review and QA though. @jggautier you are likely the only person on the team who knows what I'm even talking about. In case you want to take look and confirm.
If you check the copy of the prod. db, there are still 11 nesstar sources there:

SELECT name FROM harvestingclient WHERE harvesttype='nesstar';
adpljubljana
adpss
cora
dans
ddadenmark
esds
nsdnorway
odesi
SND
statcan
uofcnesstar

... and several thousand harvested datasets associated with them, as in
SELECT COUNT(d.id) FROM harvestingclient c, dataset d WHERE c.harvesttype='nesstar' AND d.harvestingclient_id = c.id

These have all been removed from the actual production database. So, as of next Monday when the db copy is updated the numbers above will both be zero.

@landreev landreev removed their assignment Jan 19, 2024
@jggautier
Copy link
Collaborator

Thanks @landreev. I'll let Laura Huis in 't Veld from DANS-KNAW know, in our email thread at https://help.hmdc.harvard.edu/Ticket/Display.html?id=324230, that the harvested datasets were removed and that by the end of next week I'll delete the collection at https://dataverse.harvard.edu/dataverse/dans.

And I'll let them know to reach out to us again if they'd like us to harvest the metadata, whereever it exists now. For example, one of these purged harvested datasets used be at https://easy.dans.knaw.nl/ui/datasets/id/easy-dataset:33366, and it's now accessible at https://doi.org/10.17026/dans-xtu-d36b, which is at the DANS Data Station for Social Sciences and Humanities. So maybe they'd like us to harvest the datasets from that installation.

@landreev
Copy link
Collaborator Author

Yeah, if any of it these metadata can be re-harvested from up-to-date sources, we can/should do that.

@jggautier
Copy link
Collaborator

@landreev, I'm not able to delete the collection at https://dataverse.harvard.edu/dataverse/SND, which had the harvested datasets that have been removed. When I try, the UI shows the error message:
"Error – This dataverse was not able to be deleted. If you believe this is an error, please contact Harvard Dataverse Support for assistance."

Do you think there's still something in the database that's preventing that collection from being deleted?

@landreev
Copy link
Collaborator Author

@jggautier Yes, it looks like there was some junk in the database still referencing these nesstar collections, that prevented deleting them (old legacy stuff). I cleaned it up and was able to delete SND. Please let me know if you run into anything similar with the other collections.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature: Harvesting NIH GREI General work related to any of the NIH GREI aims Size: 10 A percentage of a sprint.
Projects
None yet
Development

No branches or pull requests

3 participants