Re-harvest from Borealis Repository #172

jggautier · 2022-06-27T14:09:00Z

The harvesting client that harvested metadata from Scholar's Portal isn't listed in the Manage Harvesting Clients page and isn't in the clients list returned by the API endpoint for listing clients (https://dataverse.harvard.edu/api/harvest/clients).

So I'm not able to manage that client, such as to re-harvest from Borealis so that the links behind the dataset titles lead users to the datasets instead of an error page.

The client was listed until the following steps caused this bug:

From the Manage Harvesting Clients page, I clicked on the trash icon to delete the client, which was named "scholars-portal". In the "Last Results" column, I saw that the deletion was "in progress".
I visited the Dataverse Collection page, which had harvested 7k+ datasets, and refreshed the page to make sure that the number of harvested datasets was decreasing.
But once all records seemed to be deleted (on the Dataverse Collection page, the number of harvested datasets was 0), the "Last Results" column on the Manage Harvesting Clients page still reported that the deletion was in progress.
I renamed the Dataverse collection alias (to "borealis_harvested" due to the Scholars Portal name change)
Unexpectedly, harvested datasets started to be added to the collection. Each time I refreshed the Dataverse collection page, more datasets were added until 7k+ datasets returned. It seems like all 7k+ datasets that were there before I tried to delete them returned.
The Manage Harvesting Clients page no longer lists any client for this harvesting job, so I can't try to delete and re-add the client.

The OAI set that needs to be harvested by the Harvard Dataverse Repository contains 8,237 records as of this writing, but https://dataverse.harvard.edu/dataverse/borealis_harvested includes only 7,401 records, which I think was the same number of records that the Dataverse collection had before I tried to delete the harvesting client.

Can a developer look into what happened?
Since I can't do it through the UI, I thought maybe I could do this with an API but I can't find any API endpoints in the API guides for deleting a harvesting client. At https://demo.dataverse.org/openapi I see some references to "harvestingClient" but I'm not sure if there are any undocumented endpoints for deleting or modifying a client. Does one exist?
Should I delete the records using the API endpoint for deleting datasets and create a new harvesting client to harvest the Borealis repository's records?

sbarbosadataverse · 2022-12-09T21:25:04Z

What's the likelihood this issue will be fixed with the Harvesting updates in progress? @mreekie @siacus
We don't want to add this to the Dataverse Backlog for Harvard Dataverse if they may get fixed by the harvesting updates.

Thanks

cmbz · 2023-12-19T14:48:59Z

2023/12/19: Prioritized during meeting on 2023/12/18. Added to Needs Sizing.

cmbz · 2023-12-19T16:18:51Z

2023/12/19: @jggautier and @landreev will followup after meeting. Sizing at a 10 initially.

landreev · 2023-12-19T18:02:27Z

After a brief review, we may have an idea of what caused the weird behavior described above, during the attempt to delete the client, and with the collection view after that.
But regardless of the exact details, the Scholars Portal client, and all the harvested datasets associated with it was indeed deleted in the end.
So the next step should be to create a brand new client to harvest from their current OAI and to give it a try.

jggautier · 2023-12-19T20:33:34Z

I tried harvesting from Borealis into the collection at https://dataverse.harvard.edu/dataverse/borealis_harvested, but no records are showing up.

I was able to create a client at https://dataverse.harvard.edu/harvestclients.xhtml?dataverseId=1, so there's a row on the page's clients table, and when I press that row's button to start the harvest, the page shows the blue banner message telling me that the harvest started.

But the Last Run and Last Results columns for that row are empty, when I expected to see "IN PROGRESS", even after I refresh my browser and check the clients table on other computers.

Maybe the harvesting job is failing for some reason. Could you tell what's happening @landreev?

landreev · 2023-12-19T21:10:29Z

OK, I'll take a look. But if it's not something I can figure out right away, it'll have to wait till next year.

landreev · 2023-12-19T21:21:57Z

I was able to start it from the Harvesting Clients page (showing "IN PROGRESS" now).
Expired session, or something like that maybe?
Seeing some results in the collection now.
Let's see how it goes, how many failures we get, etc. (If there are too many, we could maybe try ddi instead?)

landreev · 2023-12-19T21:25:09Z

OK to change the title of the issue to "Re-Harvest from Borealis Repository", or something along these lines?

jggautier · 2023-12-19T22:36:15Z

Changing the title makes sense to me. Just changed it.

landreev · 2023-12-19T22:58:27Z

That's a nasty success to failures ratio. ☹️
Some controlled vocab., or metadata block mismatch between our sites maybe, that makes harvesting in native json impossible? - I'll try to take a closer look before the NY.

Deleting this client (and all the harvested content with it), then re-creating it with the DDI as the format, just to see how that goes could be an interesting experiment too.

landreev · 2023-12-19T23:07:53Z

About 700 of these errors in the harvest log:
incorrect multiple for field productionPlace
I'm assuming this means that they are running the version of the citation block where this is still a single value-only field. (Makes sense, since they are on 5.13).
Must be more problems like this with other fields and/or blocks.

jggautier · 2023-12-20T18:09:09Z

Bikramjit Singh, a system admin from Scholar's portal, wrote in the related email thread that Borealis plans to upgrade its Dataverse software early next year.

landreev · 2023-12-20T18:22:04Z

Good to know. It's still important to keep in mind that harvesting in native json is always subject to problems like this. For example, the problem I mentioned above - the specific field that we recognize as a multiple but the source instance does not - would not be an issue if we were harvesting DDI.

bikramj · 2023-12-20T18:57:09Z

Thank you. Tagging Borealis developers @JayanthyChengan and @lubitchv, if they can help with this.

jggautier · 2024-01-02T21:15:08Z

Harvard Dataverse was able to harvest 5,866 records into https://dataverse.harvard.edu/dataverse/borealis_harvested, and the table on the "Manage Harvesting Clients" shows that 13,856 records failed to be harvested.

I'm tempted to try to delete this client and try harvesting using its DDI-C metadata instead. @landreev should I try that?

amberleahey · 2024-02-22T16:43:42Z

Thank you. Tagging Borealis developers @JayanthyChengan and @lubitchv, if they can help with this.

Hi all -- we have several sets for OAI , the main ones should work https://borealisdata.ca/oai?verb=ListRecords&metadataPrefix=oai_dc or https://borealisdata.ca/oai?verb=ListIdentifiers&metadataPrefix=oai_dc
or https://borealisdata.ca/oai?verb=ListSets
Is this what is needed for the Harvard Dataverse Harvesters?

landreev · 2024-02-27T21:19:00Z

@jggautier It looks like I missed your question here back in January - but yes, if you want to experiment with deleting the existing client + records and re-harvesting from scratch, yes please go ahead. Please note that it'll take some time, possibly some hours, to delete 6K records.
Also, please run large harvests like this on dvn-cloud-app-2 specifically.

jggautier · 2024-02-28T14:10:11Z

Thanks @amberleahey. Yes, I'm going to try harvesting Borealis metadata using the DDI metadata format instead of the Dataverse native json format that I usually use.

I just told Harvard Dataverse to delete the client that was trying each week to harvest metadata from Borealis, and I'll check tomorrow since it'll take a while like @landreev wrote.

Then if the client and all harvested metadata has been deleted, I'll make sure that I'm on dvn-cloud-app-2, create a new client without specifying a set so that metadata from all datasets published in Borealis are harvested into the collection at https://dataverse.harvard.edu/dataverse/borealis_harvested, tell Harvard Dataverse to harvest all metadata from Borealis, and see if it gets the metadata of all 20k datasets

jggautier · 2024-02-29T21:17:07Z

The client isn't listed on the table on the Manage Harvesting Client page anymore and all records were deleted when I checked this morning.

I made sure I was on dvn-cloud-app-2.lib.harvard.edu, created a new client without specifying a set, using the oai_ddi format and the "Dataverse v4+" "Archive Type", and told Harvard Dataverse to start harvesting.

Records are being added to https://dataverse.harvard.edu/dataverse/borealis_harvested. So far so good!

I'll check tomorrow to many of the 20,093 datasets were harvested.

jggautier · 2024-03-01T23:01:51Z

Hmmm, the Manage Harvesting Client page says that 19207 records were harvested and 453 records failed:

But there are 4,777 in https://dataverse.harvard.edu/dataverse/borealis_harvested. I'm not sure where the client page gets the 19207 number from. @landreev, any ideas?

I haven't tried to see why most records failed to be harvested, which I might do by comparing the oai_ddi metadata of records that were harvested to the oai_ddi metadata of records that weren't.

Maybe there's another way to get more info about the failures?

landreev · 2024-03-14T16:56:29Z

I wasn't able to tell what was up with the mismatched counts above right away. Going to take another, closer look.

landreev · 2024-03-14T17:30:03Z

As always, the simplest explanation ends up being the correct one. The extra 15K records were successfully harvested, they are in the database, but they didn't get indexed in solr.
The reason they didn't get indexed was that solr apparently was overwhelmed and started dropping requests. I am not sure yet whether it was overwhelmed by this very harvest - by having to index so many records in a row in such a short period of time, or if it was having trouble for unrelated reasons (like bots crawling collection pages). I have some suspicions that it may indeed be the former. One way or another, we do need an extra mechanism for monitoring for unindexed datasets in the database (harvested or local), and reindexing them after the fact. That will be handled outside of this issue.

I am reindexing these extra Borealis records now, so the numbers showing on the collection page should be growing. (but I'm going through them slowly/sleeping between datasets, to be gentle on solr)

landreev · 2024-03-14T19:00:08Z

I stopped it after 4K datasets. But we will get it reindexed eventually.

landreev · 2024-04-08T15:31:31Z

The status of this effort: the actual harvesting of their records is working ok now, with a fairly decent succes-to-failure ratio using the ddi format. The problem area is indexing - the harvested records do not all get indexed in real time, and therefore do not show up in the collection. This is issue is outside of the harvesting framework and has to do with the general solr/indexing issues we are having in production (new investigation issue IQSS/dataverse#10469), but the effect of it is especially noticeable in this scenario, an initial harvest of a very large (thousands+ of datasets, 20K+ in this case) - it is simply no longer possible to quickly index that many datasets in a row.

I got a few more Ks of the Borealis datasets indexed, but, again, had to stop since it was clearly causing solr issues.

jggautier · 2024-04-16T14:31:30Z

Thanks! I think this might be the case with at least one other Dataverse installation that Harvard Dataverse harvests from.

There are 270 unique records in the "all_items_dataverse_mel" set of the client named icarda (https://data.mel.cgiar.org/oai). And the copy of the database I use says that there are 270 datasets in the collection that we harvest those records into, although that copy hasn't been updated in about a month.

But only 224 records appear in that collection in the UI and from Search API results.

cmbz · 2024-07-10T21:59:00Z

2024/07/10

Any updates on progress for this issue @jggautier @landreev ?

jggautier · 2024-07-11T19:03:46Z

We haven't told Harvard Dataverse to update the records it has from Borealis Repository. I think @landreev and others continue work on indexing improvements that will help Harvard Dataverse harvest more records, or more specifically ensure that those records appear on search pages, so harvesting, or updating the records that Harvard Dataverse has harvested from any repositories, has been put on hold.

I think the same is true for all GitHub issues listed in IQSS/dataverse-pm#171 that are about ensuring that Harvard Dataverse is able to harvest all records from the repositories it harvests from and would like to start harvesting from.

@landreev I hope you don't mind that I defer to you about the status of that indexing work.

cmbz · 2024-07-17T20:14:44Z

Assigning you both @landreev @jggautier to monitor this issue. Thanks.

jggautier · 2024-09-11T19:17:43Z

Like other GitHub issues about Harvard Dataverse harvesting from other repositories, this issue is on hold pending work being done to improve how Dataverse harvests.

jggautier added the bug Something isn't working label Jun 27, 2022

jggautier changed the title ~~Unable to remove harvested datasets from Borealis Harvested Dataverse~~ Harvesting client missing on Manage Harvesting Clients page; unable to update client to fix broken links to datasets in Borealis Feb 14, 2023

jggautier mentioned this issue Sep 11, 2023

NIH AIM:4 YR:2 TASK:1B | 2.4.1B | (started yr1) Resolve OAI-PMH harvesting issues IQSS/dataverse-pm#85

Closed

cmbz added NIH GREI General work related to any of the NIH GREI aims Feature: Harvesting labels Dec 18, 2023

cmbz added this to IQSS Dataverse Project Dec 18, 2023

cmbz moved this to SPRINT- NEEDS SIZING in IQSS Dataverse Project Dec 18, 2023

cmbz added the Size: 10 A percentage of a sprint. label Dec 19, 2023

cmbz moved this from SPRINT- NEEDS SIZING to SPRINT READY in IQSS Dataverse Project Dec 19, 2023

jggautier changed the title ~~Harvesting client missing on Manage Harvesting Clients page; unable to update client to fix broken links to datasets in Borealis~~ Re-harvest from Borealis Repository Dec 19, 2023

jggautier mentioned this issue Jan 2, 2024

Spike: Review if inability to harvest all records from certain repositories will be or has been resolved when other harvesting-related issues are addressed #92

Closed

cmbz mentioned this issue Jan 29, 2024

GREI 3: HDV Task - Improve OAI-PMH Harvesting IQSS/dataverse-pm#171

Open

56 tasks

landreev moved this from SPRINT READY to This Sprint 🏃‍♀️ 🏃 in IQSS Dataverse Project Mar 14, 2024

landreev added the Status: Needs Input Applied to issues in need of input from someone currently unavailable label Apr 10, 2024

cmbz added the GREI 3 Search and Browse label Apr 19, 2024

cmbz moved this from This Sprint 🏃‍♀️ 🏃 to Waiting ⌛ in IQSS Dataverse Project May 8, 2024

cmbz removed the Status: Needs Input Applied to issues in need of input from someone currently unavailable label May 8, 2024

cmbz assigned landreev and jggautier Jul 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-harvest from Borealis Repository #172

Re-harvest from Borealis Repository #172

jggautier commented Jun 27, 2022 •

edited

Loading

sbarbosadataverse commented Dec 9, 2022

cmbz commented Dec 19, 2023

cmbz commented Dec 19, 2023

landreev commented Dec 19, 2023

jggautier commented Dec 19, 2023 •

edited

Loading

landreev commented Dec 19, 2023

landreev commented Dec 19, 2023

landreev commented Dec 19, 2023

jggautier commented Dec 19, 2023

landreev commented Dec 19, 2023 •

edited

Loading

landreev commented Dec 19, 2023

jggautier commented Dec 20, 2023

landreev commented Dec 20, 2023

bikramj commented Dec 20, 2023

jggautier commented Jan 2, 2024

amberleahey commented Feb 22, 2024

landreev commented Feb 27, 2024

jggautier commented Feb 28, 2024 •

edited

Loading

jggautier commented Feb 29, 2024 •

edited

Loading

jggautier commented Mar 1, 2024 •

edited

Loading

landreev commented Mar 14, 2024

landreev commented Mar 14, 2024

landreev commented Mar 14, 2024

landreev commented Apr 8, 2024

jggautier commented Apr 16, 2024 •

edited

Loading

cmbz commented Jul 10, 2024

jggautier commented Jul 11, 2024 •

edited

Loading

cmbz commented Jul 17, 2024

jggautier commented Sep 11, 2024

Re-harvest from Borealis Repository #172

Re-harvest from Borealis Repository #172

Comments

jggautier commented Jun 27, 2022 • edited Loading

sbarbosadataverse commented Dec 9, 2022

cmbz commented Dec 19, 2023

cmbz commented Dec 19, 2023

landreev commented Dec 19, 2023

jggautier commented Dec 19, 2023 • edited Loading

landreev commented Dec 19, 2023

landreev commented Dec 19, 2023

landreev commented Dec 19, 2023

jggautier commented Dec 19, 2023

landreev commented Dec 19, 2023 • edited Loading

landreev commented Dec 19, 2023

jggautier commented Dec 20, 2023

landreev commented Dec 20, 2023

bikramj commented Dec 20, 2023

jggautier commented Jan 2, 2024

amberleahey commented Feb 22, 2024

landreev commented Feb 27, 2024

jggautier commented Feb 28, 2024 • edited Loading

jggautier commented Feb 29, 2024 • edited Loading

jggautier commented Mar 1, 2024 • edited Loading

landreev commented Mar 14, 2024

landreev commented Mar 14, 2024

landreev commented Mar 14, 2024

landreev commented Apr 8, 2024

jggautier commented Apr 16, 2024 • edited Loading

cmbz commented Jul 10, 2024

jggautier commented Jul 11, 2024 • edited Loading

cmbz commented Jul 17, 2024

jggautier commented Sep 11, 2024

jggautier commented Jun 27, 2022 •

edited

Loading

jggautier commented Dec 19, 2023 •

edited

Loading

landreev commented Dec 19, 2023 •

edited

Loading

jggautier commented Feb 28, 2024 •

edited

Loading

jggautier commented Feb 29, 2024 •

edited

Loading

jggautier commented Mar 1, 2024 •

edited

Loading

jggautier commented Apr 16, 2024 •

edited

Loading

jggautier commented Jul 11, 2024 •

edited

Loading