-
-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
speed up indexing #1364
Comments
This issue is making work nearly impossible this afternoon. It is taking multiple hours to index a single, small data object or EML. While this persists, it is best if only one update is made at a time (possibly only one per day). It creates a really poor experience for |
Hey @jagoldstein that's definitely not the desired behavior here. For me and other devs: Looking at catalina.out, it's pretty clear that part of this is related to the cascade of updates a single index task triggers. Creating and inserting Solr docs itself stays fast, ~1 insert per second, but a single index task is causing many Solr docs to get inserted. For example, this call in R triggered a lot of Solr docs to get inserted: # Other details removed for simplicity
publish_update(mn,
metadata_pid = "urn:uuid:2a706b99-d138-4e92-a643-7688df172804",
resource_map_pid = "resource_map_urn:uuid:2a706b99-d138-4e92-a643-7688df172804",
data_pids = c("urn:uuid:a4b8d5ed-a2db-4eee-b9d1-ac537595f297",
"urn:uuid:1bf75e5f-28f7-489d-ae5b-a09dbfff41f3",
"urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf")) At face value, that's a really innocuous-looking job to index. But when I look at catalina.out, I see something suspicious:
The above output makes it look like the Solr doc for
12 times. Mostly just putting these notes in here for now. Thanks again @jagoldstein |
Increased the priority on this to critical due to the increasing number of packages we are processing with thousands of entries. |
Indexing new content in Metacat is a bottleneck that needs to be fixed. Here's a report from Jesse G about the issues they are seeing.
Metacat depends on DataONE's indexer, so many of these changes would need to occur there, but I am reporting this here to ensure it gets prioritized in Metacat's release schedule.
Consider:
We should pursue each of these, but focus on the ones that have the largest performance gains first. The big bottlenecks occur when we submit many thousands of objects in a short period, so parallel processing using a cluster that can scale to hundreds or thousands of nodes might be the highest priority, followed by eliminating unneeded index jobs, and then finally speeding up each individual job.
Goals:
The text was updated successfully, but these errors were encountered: