-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-harvest from Borealis Repository #172
Comments
2023/12/19: Prioritized during meeting on 2023/12/18. Added to Needs Sizing. |
2023/12/19: @jggautier and @landreev will followup after meeting. Sizing at a 10 initially. |
After a brief review, we may have an idea of what caused the weird behavior described above, during the attempt to delete the client, and with the collection view after that. |
I tried harvesting from Borealis into the collection at https://dataverse.harvard.edu/dataverse/borealis_harvested, but no records are showing up. I was able to create a client at https://dataverse.harvard.edu/harvestclients.xhtml?dataverseId=1, so there's a row on the page's clients table, and when I press that row's button to start the harvest, the page shows the blue banner message telling me that the harvest started. But the Last Run and Last Results columns for that row are empty, when I expected to see "IN PROGRESS", even after I refresh my browser and check the clients table on other computers. Maybe the harvesting job is failing for some reason. Could you tell what's happening @landreev? |
OK, I'll take a look. But if it's not something I can figure out right away, it'll have to wait till next year. |
I was able to start it from the Harvesting Clients page (showing "IN PROGRESS" now). |
OK to change the title of the issue to "Re-Harvest from Borealis Repository", or something along these lines? |
Changing the title makes sense to me. Just changed it. |
That's a nasty success to failures ratio. Deleting this client (and all the harvested content with it), then re-creating it with the DDI as the format, just to see how that goes could be an interesting experiment too. |
About 700 of these errors in the harvest log: |
Bikramjit Singh, a system admin from Scholar's portal, wrote in the related email thread that Borealis plans to upgrade its Dataverse software early next year. |
Good to know. It's still important to keep in mind that harvesting in native json is always subject to problems like this. For example, the problem I mentioned above - the specific field that we recognize as a multiple but the source instance does not - would not be an issue if we were harvesting DDI. |
Thank you. Tagging Borealis developers @JayanthyChengan and @lubitchv, if they can help with this. |
Harvard Dataverse was able to harvest 5,866 records into https://dataverse.harvard.edu/dataverse/borealis_harvested, and the table on the "Manage Harvesting Clients" shows that 13,856 records failed to be harvested. I'm tempted to try to delete this client and try harvesting using its DDI-C metadata instead. @landreev should I try that? |
Hi all -- we have several sets for OAI , the main ones should work https://borealisdata.ca/oai?verb=ListRecords&metadataPrefix=oai_dc or https://borealisdata.ca/oai?verb=ListIdentifiers&metadataPrefix=oai_dc |
@jggautier It looks like I missed your question here back in January - but yes, if you want to experiment with deleting the existing client + records and re-harvesting from scratch, yes please go ahead. Please note that it'll take some time, possibly some hours, to delete 6K records. |
Thanks @amberleahey. Yes, I'm going to try harvesting Borealis metadata using the DDI metadata format instead of the Dataverse native json format that I usually use. I just told Harvard Dataverse to delete the client that was trying each week to harvest metadata from Borealis, and I'll check tomorrow since it'll take a while like @landreev wrote. Then if the client and all harvested metadata has been deleted, I'll make sure that I'm on dvn-cloud-app-2, create a new client without specifying a set so that metadata from all datasets published in Borealis are harvested into the collection at https://dataverse.harvard.edu/dataverse/borealis_harvested, tell Harvard Dataverse to harvest all metadata from Borealis, and see if it gets the metadata of all 20k datasets |
The client isn't listed on the table on the Manage Harvesting Client page anymore and all records were deleted when I checked this morning. I made sure I was on dvn-cloud-app-2.lib.harvard.edu, created a new client without specifying a set, using the oai_ddi format and the "Dataverse v4+" "Archive Type", and told Harvard Dataverse to start harvesting. Records are being added to https://dataverse.harvard.edu/dataverse/borealis_harvested. So far so good! I'll check tomorrow to many of the 20,093 datasets were harvested. |
Hmmm, the Manage Harvesting Client page says that 19207 records were harvested and 453 records failed: But there are 4,777 in https://dataverse.harvard.edu/dataverse/borealis_harvested. I'm not sure where the client page gets the 19207 number from. @landreev, any ideas? I haven't tried to see why most records failed to be harvested, which I might do by comparing the oai_ddi metadata of records that were harvested to the oai_ddi metadata of records that weren't. Maybe there's another way to get more info about the failures? |
I wasn't able to tell what was up with the mismatched counts above right away. Going to take another, closer look. |
As always, the simplest explanation ends up being the correct one. The extra 15K records were successfully harvested, they are in the database, but they didn't get indexed in solr. I am reindexing these extra Borealis records now, so the numbers showing on the collection page should be growing. (but I'm going through them slowly/sleeping between datasets, to be gentle on solr) |
I stopped it after 4K datasets. But we will get it reindexed eventually. |
The status of this effort: the actual harvesting of their records is working ok now, with a fairly decent succes-to-failure ratio using the ddi format. The problem area is indexing - the harvested records do not all get indexed in real time, and therefore do not show up in the collection. This is issue is outside of the harvesting framework and has to do with the general solr/indexing issues we are having in production (new investigation issue IQSS/dataverse#10469), but the effect of it is especially noticeable in this scenario, an initial harvest of a very large (thousands+ of datasets, 20K+ in this case) - it is simply no longer possible to quickly index that many datasets in a row. I got a few more Ks of the Borealis datasets indexed, but, again, had to stop since it was clearly causing solr issues. |
Thanks! I think this might be the case with at least one other Dataverse installation that Harvard Dataverse harvests from. There are 270 unique records in the "all_items_dataverse_mel" set of the client named icarda (https://data.mel.cgiar.org/oai). And the copy of the database I use says that there are 270 datasets in the collection that we harvest those records into, although that copy hasn't been updated in about a month. But only 224 records appear in that collection in the UI and from Search API results. |
2024/07/10
|
We haven't told Harvard Dataverse to update the records it has from Borealis Repository. I think @landreev and others continue work on indexing improvements that will help Harvard Dataverse harvest more records, or more specifically ensure that those records appear on search pages, so harvesting, or updating the records that Harvard Dataverse has harvested from any repositories, has been put on hold. I think the same is true for all GitHub issues listed in IQSS/dataverse-pm#171 that are about ensuring that Harvard Dataverse is able to harvest all records from the repositories it harvests from and would like to start harvesting from. @landreev I hope you don't mind that I defer to you about the status of that indexing work. |
Assigning you both @landreev @jggautier to monitor this issue. Thanks. |
Like other GitHub issues about Harvard Dataverse harvesting from other repositories, this issue is on hold pending work being done to improve how Dataverse harvests. |
The harvesting client that harvested metadata from Scholar's Portal isn't listed in the Manage Harvesting Clients page and isn't in the clients list returned by the API endpoint for listing clients (https://dataverse.harvard.edu/api/harvest/clients).
So I'm not able to manage that client, such as to re-harvest from Borealis so that the links behind the dataset titles lead users to the datasets instead of an error page.
The client was listed until the following steps caused this bug:
The OAI set that needs to be harvested by the Harvard Dataverse Repository contains 8,237 records as of this writing, but https://dataverse.harvard.edu/dataverse/borealis_harvested includes only 7,401 records, which I think was the same number of records that the Dataverse collection had before I tried to delete the harvesting client.
The text was updated successfully, but these errors were encountered: