Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to create harvesting set of datasets in "Original Murray Collection" #124

Closed
jggautier opened this issue Sep 27, 2021 · 4 comments · Fixed by IQSS/dataverse#8197
Closed
Labels
bug Something isn't working

Comments

@jggautier
Copy link
Collaborator

jggautier commented Sep 27, 2021

We need to create a harvesting set containing "original" Murray datasets (this discussion and work is tracked in #68). Datasets in the Murray Research Archive Dataverse were reorganized so that the "original" datasets are within a newly created Dataverse collection (or within its subcollections) called Original Murray Collection (https://dataverse.harvard.edu/dataverse/originalMRA), which is within the Murray Research Archive Dataverse.

When I try to create a harvesting set of the datasets in the newer Original Murray Collection, the "Create Harvesting Set" popup tells me that the search query returned no results:

Screen Shot 2021-09-27 at 4 04 47 PM

5112855 is the database ID of the Original Murray Collection.

And when I created the set, no datasets were in it.

The search page and the Search API return the 339 datasets that are within the Original Murray Collection and that we need to be in a harvesting set.

Some troubleshooting

Harvard Dataverse Repository and Demo Dataverse were on v5.6 when I tried creating the harvesting set and the following troubleshooting.

To see if the issue was related to all of the datasets being moved into the collection (as opposed to being created in the collection), on Demo Dataverse I created a new Dataverse collection, moved already-published datasets into that new collection, and tried to create a harvesting set. Demo Dataverse told me that the search query returned no results.

To see if the issue might instead or also be related to trying to create a harvesting set of datasets contained in a relatively new collection (created past a certain date), I found a collection on Harvard Dataverse Repository published today with a published dataset (also created today), and tried to create a harvesting set using that "subtree" query to include datasets in that collection. The "Create Harvesting Set" popup told me that it found one dataset.

I didn't want to try isolating the issue further by moving or creating new collections or datasets on the Harvard Dataverse Repository (as opposed to Demo Dataverse) because of the extra work involved in notifying people about the testing and destroying datasets or moving others' datasets back to their original location.

But hopefully this is helpful for more investigation into what's not allowing me to create a harvesting set that contains the datasets in the Original Murray Collection (https://dataverse.harvard.edu/dataverse/originalMRA).

@jggautier jggautier added the bug Something isn't working label Sep 27, 2021
@djbrooke
Copy link
Contributor

djbrooke commented Oct 27, 2021

Check if the subtreePaths has to include the database ID of all parent collections of the collection whose datasets need to be in the harvesting set (excluding the "Root" Dataverse collection).

That is, the query for creating a harvesting set containing datasets in the Original MRA Collection should be subtreePaths:"/10/5112855", since the ID for the parent MRA Dataverse collection is 10. If that's how it should work, update the guide (@djbrooke and @jggautier)

@jggautier
Copy link
Collaborator Author

jggautier commented Oct 27, 2021

subtreePaths:"/10/5112855" worked. https://dataverse.harvard.edu/oai?verb=ListRecords&set=Original_MRA_Collection&metadataPrefix=oai_dc

So the subtreePaths has to include the database IDs of each of the collection's parent collections (excluding the "Root" Dataverse collection). I suppose it's not really a "path" if it doesn't include a kind of breadcrumb to the collection whose datasets need to be in the harvesting set. So a user could infer from the word "path" in "subtreePaths" that it must include the database IDs of its parent collections.

We could include this explanation in the guides.

In the v5.7 Admin Guide, here's the part of the "Managing Harvesting Server and Sets" page that describes subtreePaths:

Screen Shot 2021-10-27 at 3 00 22 PM

I'm also wondering why the system couldn't figure out the path on its own, given the ID of the collection.

@djbrooke
Copy link
Contributor

@jggautier - great that it works! I'll make a PR with a change that describes this in more detail - I didn't know about this either.

@pdurbin
Copy link
Member

pdurbin commented Oct 27, 2021

I'm also wondering why the system couldn't figure out the path on its own, given the ID of the collection.

I just thought I'd pipe in and say the system does figure it out for the Search API and you can pass the alias of the dataverse collection. The logic is all in Search.java and is only used by the Search API but it could be centralized. The Search API ultimately uses subtreePaths under the covers but it's pretty low-level. As you can see, you have to put database IDs in it. I think a longer term fix would be for an issue with a title something like "For harvesting, deprecate subtreePaths and introduce 'subtree' variable like the Search API". That is, make harvesting as easy as the Search API when it comes to creating the query. Centralize the logic. Stop using the low-level subtreePaths in harvesting. I hope that makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants