Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-harvest from Borealis Repository #172

Open
jggautier opened this issue Jun 27, 2022 · 29 comments
Open

Re-harvest from Borealis Repository #172

jggautier opened this issue Jun 27, 2022 · 29 comments
Assignees
Labels
bug Something isn't working Feature: Harvesting GREI 3 Search and Browse NIH GREI General work related to any of the NIH GREI aims Size: 10 A percentage of a sprint.

Comments

@jggautier
Copy link
Collaborator

jggautier commented Jun 27, 2022

The harvesting client that harvested metadata from Scholar's Portal isn't listed in the Manage Harvesting Clients page and isn't in the clients list returned by the API endpoint for listing clients (https://dataverse.harvard.edu/api/harvest/clients).

So I'm not able to manage that client, such as to re-harvest from Borealis so that the links behind the dataset titles lead users to the datasets instead of an error page.

The client was listed until the following steps caused this bug:

  • From the Manage Harvesting Clients page, I clicked on the trash icon to delete the client, which was named "scholars-portal". In the "Last Results" column, I saw that the deletion was "in progress".
  • I visited the Dataverse Collection page, which had harvested 7k+ datasets, and refreshed the page to make sure that the number of harvested datasets was decreasing.
  • But once all records seemed to be deleted (on the Dataverse Collection page, the number of harvested datasets was 0), the "Last Results" column on the Manage Harvesting Clients page still reported that the deletion was in progress.
  • I renamed the Dataverse collection alias (to "borealis_harvested" due to the Scholars Portal name change)
  • Unexpectedly, harvested datasets started to be added to the collection. Each time I refreshed the Dataverse collection page, more datasets were added until 7k+ datasets returned. It seems like all 7k+ datasets that were there before I tried to delete them returned.
  • The Manage Harvesting Clients page no longer lists any client for this harvesting job, so I can't try to delete and re-add the client.

The OAI set that needs to be harvested by the Harvard Dataverse Repository contains 8,237 records as of this writing, but https://dataverse.harvard.edu/dataverse/borealis_harvested includes only 7,401 records, which I think was the same number of records that the Dataverse collection had before I tried to delete the harvesting client.

  • Can a developer look into what happened?
  • Since I can't do it through the UI, I thought maybe I could do this with an API but I can't find any API endpoints in the API guides for deleting a harvesting client. At https://demo.dataverse.org/openapi I see some references to "harvestingClient" but I'm not sure if there are any undocumented endpoints for deleting or modifying a client. Does one exist?
  • Should I delete the records using the API endpoint for deleting datasets and create a new harvesting client to harvest the Borealis repository's records?
@jggautier jggautier added the bug Something isn't working label Jun 27, 2022
@sbarbosadataverse
Copy link

What's the likelihood this issue will be fixed with the Harvesting updates in progress? @mreekie @siacus
We don't want to add this to the Dataverse Backlog for Harvard Dataverse if they may get fixed by the harvesting updates.

Thanks

@jggautier jggautier changed the title Unable to remove harvested datasets from Borealis Harvested Dataverse Harvesting client missing on Manage Harvesting Clients page; unable to update client to fix broken links to datasets in Borealis Feb 14, 2023
@cmbz cmbz added NIH GREI General work related to any of the NIH GREI aims Feature: Harvesting labels Dec 18, 2023
@cmbz cmbz moved this to SPRINT- NEEDS SIZING in IQSS Dataverse Project Dec 18, 2023
@cmbz
Copy link
Collaborator

cmbz commented Dec 19, 2023

2023/12/19: Prioritized during meeting on 2023/12/18. Added to Needs Sizing.

@cmbz
Copy link
Collaborator

cmbz commented Dec 19, 2023

2023/12/19: @jggautier and @landreev will followup after meeting. Sizing at a 10 initially.

@cmbz cmbz added the Size: 10 A percentage of a sprint. label Dec 19, 2023
@cmbz cmbz moved this from SPRINT- NEEDS SIZING to SPRINT READY in IQSS Dataverse Project Dec 19, 2023
@landreev
Copy link
Collaborator

After a brief review, we may have an idea of what caused the weird behavior described above, during the attempt to delete the client, and with the collection view after that.
But regardless of the exact details, the Scholars Portal client, and all the harvested datasets associated with it was indeed deleted in the end.
So the next step should be to create a brand new client to harvest from their current OAI and to give it a try.

@jggautier
Copy link
Collaborator Author

jggautier commented Dec 19, 2023

I tried harvesting from Borealis into the collection at https://dataverse.harvard.edu/dataverse/borealis_harvested, but no records are showing up.

I was able to create a client at https://dataverse.harvard.edu/harvestclients.xhtml?dataverseId=1, so there's a row on the page's clients table, and when I press that row's button to start the harvest, the page shows the blue banner message telling me that the harvest started.

But the Last Run and Last Results columns for that row are empty, when I expected to see "IN PROGRESS", even after I refresh my browser and check the clients table on other computers.

Maybe the harvesting job is failing for some reason. Could you tell what's happening @landreev?

@landreev
Copy link
Collaborator

OK, I'll take a look. But if it's not something I can figure out right away, it'll have to wait till next year.

@landreev
Copy link
Collaborator

I was able to start it from the Harvesting Clients page (showing "IN PROGRESS" now).
Expired session, or something like that maybe?
Seeing some results in the collection now.
Let's see how it goes, how many failures we get, etc. (If there are too many, we could maybe try ddi instead?)

@landreev
Copy link
Collaborator

OK to change the title of the issue to "Re-Harvest from Borealis Repository", or something along these lines?

@jggautier jggautier changed the title Harvesting client missing on Manage Harvesting Clients page; unable to update client to fix broken links to datasets in Borealis Re-harvest from Borealis Repository Dec 19, 2023
@jggautier
Copy link
Collaborator Author

Changing the title makes sense to me. Just changed it.

@landreev
Copy link
Collaborator

landreev commented Dec 19, 2023

That's a nasty success to failures ratio. ☹️
Some controlled vocab., or metadata block mismatch between our sites maybe, that makes harvesting in native json impossible? - I'll try to take a closer look before the NY.

Deleting this client (and all the harvested content with it), then re-creating it with the DDI as the format, just to see how that goes could be an interesting experiment too.

@landreev
Copy link
Collaborator

About 700 of these errors in the harvest log:
incorrect multiple for field productionPlace
I'm assuming this means that they are running the version of the citation block where this is still a single value-only field. (Makes sense, since they are on 5.13).
Must be more problems like this with other fields and/or blocks.

@jggautier
Copy link
Collaborator Author

Bikramjit Singh, a system admin from Scholar's portal, wrote in the related email thread that Borealis plans to upgrade its Dataverse software early next year.

@landreev
Copy link
Collaborator

Good to know. It's still important to keep in mind that harvesting in native json is always subject to problems like this. For example, the problem I mentioned above - the specific field that we recognize as a multiple but the source instance does not - would not be an issue if we were harvesting DDI.

@bikramj
Copy link

bikramj commented Dec 20, 2023

Thank you. Tagging Borealis developers @JayanthyChengan and @lubitchv, if they can help with this.

@jggautier
Copy link
Collaborator Author

Harvard Dataverse was able to harvest 5,866 records into https://dataverse.harvard.edu/dataverse/borealis_harvested, and the table on the "Manage Harvesting Clients" shows that 13,856 records failed to be harvested.

I'm tempted to try to delete this client and try harvesting using its DDI-C metadata instead. @landreev should I try that?

@amberleahey
Copy link

Thank you. Tagging Borealis developers @JayanthyChengan and @lubitchv, if they can help with this.

Hi all -- we have several sets for OAI , the main ones should work https://borealisdata.ca/oai?verb=ListRecords&metadataPrefix=oai_dc or https://borealisdata.ca/oai?verb=ListIdentifiers&metadataPrefix=oai_dc
or https://borealisdata.ca/oai?verb=ListSets
Is this what is needed for the Harvard Dataverse Harvesters?

@landreev
Copy link
Collaborator

@jggautier It looks like I missed your question here back in January - but yes, if you want to experiment with deleting the existing client + records and re-harvesting from scratch, yes please go ahead. Please note that it'll take some time, possibly some hours, to delete 6K records.
Also, please run large harvests like this on dvn-cloud-app-2 specifically.

@jggautier
Copy link
Collaborator Author

jggautier commented Feb 28, 2024

Thanks @amberleahey. Yes, I'm going to try harvesting Borealis metadata using the DDI metadata format instead of the Dataverse native json format that I usually use.

I just told Harvard Dataverse to delete the client that was trying each week to harvest metadata from Borealis, and I'll check tomorrow since it'll take a while like @landreev wrote.

Then if the client and all harvested metadata has been deleted, I'll make sure that I'm on dvn-cloud-app-2, create a new client without specifying a set so that metadata from all datasets published in Borealis are harvested into the collection at https://dataverse.harvard.edu/dataverse/borealis_harvested, tell Harvard Dataverse to harvest all metadata from Borealis, and see if it gets the metadata of all 20k datasets

@jggautier
Copy link
Collaborator Author

jggautier commented Feb 29, 2024

The client isn't listed on the table on the Manage Harvesting Client page anymore and all records were deleted when I checked this morning.

I made sure I was on dvn-cloud-app-2.lib.harvard.edu, created a new client without specifying a set, using the oai_ddi format and the "Dataverse v4+" "Archive Type", and told Harvard Dataverse to start harvesting.

Records are being added to https://dataverse.harvard.edu/dataverse/borealis_harvested. So far so good!

I'll check tomorrow to many of the 20,093 datasets were harvested.

@jggautier
Copy link
Collaborator Author

jggautier commented Mar 1, 2024

Hmmm, the Manage Harvesting Client page says that 19207 records were harvested and 453 records failed:
Screenshot 2024-03-01 at 5 56 26 PM

But there are 4,777 in https://dataverse.harvard.edu/dataverse/borealis_harvested. I'm not sure where the client page gets the 19207 number from. @landreev, any ideas?

I haven't tried to see why most records failed to be harvested, which I might do by comparing the oai_ddi metadata of records that were harvested to the oai_ddi metadata of records that weren't.

Maybe there's another way to get more info about the failures?

@landreev
Copy link
Collaborator

I wasn't able to tell what was up with the mismatched counts above right away. Going to take another, closer look.

@landreev
Copy link
Collaborator

As always, the simplest explanation ends up being the correct one. The extra 15K records were successfully harvested, they are in the database, but they didn't get indexed in solr.
The reason they didn't get indexed was that solr apparently was overwhelmed and started dropping requests. I am not sure yet whether it was overwhelmed by this very harvest - by having to index so many records in a row in such a short period of time, or if it was having trouble for unrelated reasons (like bots crawling collection pages). I have some suspicions that it may indeed be the former. One way or another, we do need an extra mechanism for monitoring for unindexed datasets in the database (harvested or local), and reindexing them after the fact. That will be handled outside of this issue.

I am reindexing these extra Borealis records now, so the numbers showing on the collection page should be growing. (but I'm going through them slowly/sleeping between datasets, to be gentle on solr)

@landreev
Copy link
Collaborator

I stopped it after 4K datasets. But we will get it reindexed eventually.

@landreev landreev moved this from SPRINT READY to This Sprint 🏃‍♀️ 🏃 in IQSS Dataverse Project Mar 14, 2024
@landreev
Copy link
Collaborator

landreev commented Apr 8, 2024

The status of this effort: the actual harvesting of their records is working ok now, with a fairly decent succes-to-failure ratio using the ddi format. The problem area is indexing - the harvested records do not all get indexed in real time, and therefore do not show up in the collection. This is issue is outside of the harvesting framework and has to do with the general solr/indexing issues we are having in production (new investigation issue IQSS/dataverse#10469), but the effect of it is especially noticeable in this scenario, an initial harvest of a very large (thousands+ of datasets, 20K+ in this case) - it is simply no longer possible to quickly index that many datasets in a row.

I got a few more Ks of the Borealis datasets indexed, but, again, had to stop since it was clearly causing solr issues.

@landreev landreev added the Status: Needs Input Applied to issues in need of input from someone currently unavailable label Apr 10, 2024
@jggautier
Copy link
Collaborator Author

jggautier commented Apr 16, 2024

Thanks! I think this might be the case with at least one other Dataverse installation that Harvard Dataverse harvests from.

There are 270 unique records in the "all_items_dataverse_mel" set of the client named icarda (https://data.mel.cgiar.org/oai). And the copy of the database I use says that there are 270 datasets in the collection that we harvest those records into, although that copy hasn't been updated in about a month.

But only 224 records appear in that collection in the UI and from Search API results.

@cmbz cmbz added the GREI 3 Search and Browse label Apr 19, 2024
@cmbz cmbz moved this from This Sprint 🏃‍♀️ 🏃 to Waiting ⌛ in IQSS Dataverse Project May 8, 2024
@cmbz cmbz removed the Status: Needs Input Applied to issues in need of input from someone currently unavailable label May 8, 2024
@cmbz
Copy link
Collaborator

cmbz commented Jul 10, 2024

2024/07/10

@jggautier
Copy link
Collaborator Author

jggautier commented Jul 11, 2024

We haven't told Harvard Dataverse to update the records it has from Borealis Repository. I think @landreev and others continue work on indexing improvements that will help Harvard Dataverse harvest more records, or more specifically ensure that those records appear on search pages, so harvesting, or updating the records that Harvard Dataverse has harvested from any repositories, has been put on hold.

I think the same is true for all GitHub issues listed in IQSS/dataverse-pm#171 that are about ensuring that Harvard Dataverse is able to harvest all records from the repositories it harvests from and would like to start harvesting from.

@landreev I hope you don't mind that I defer to you about the status of that indexing work.

@cmbz
Copy link
Collaborator

cmbz commented Jul 17, 2024

Assigning you both @landreev @jggautier to monitor this issue. Thanks.

@jggautier
Copy link
Collaborator Author

Like other GitHub issues about Harvard Dataverse harvesting from other repositories, this issue is on hold pending work being done to improve how Dataverse harvests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Feature: Harvesting GREI 3 Search and Browse NIH GREI General work related to any of the NIH GREI aims Size: 10 A percentage of a sprint.
Projects
Status: On Hold ⌛
Development

No branches or pull requests

6 participants