Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

speed up indexing #1364

Closed
mbjones opened this issue Jun 28, 2019 · 3 comments
Closed

speed up indexing #1364

mbjones opened this issue Jun 28, 2019 · 3 comments

Comments

@mbjones
Copy link
Member

mbjones commented Jun 28, 2019

Indexing new content in Metacat is a bottleneck that needs to be fixed. Here's a report from Jesse G about the issues they are seeing.

I'm writing to express concern for the speed of the indexing queue. It's sluggishness continues to be an obstacle to the support team's ability to efficiently process and publish data packages. This is especially true for resource maps that aggregate 100s of data objects.

We have been putting off publishing 187,000 data files (submitted by Jeremy May) due to our concern that we'll backlog the index. Now that I am able to peek inside the queue, thanks to the command line tool that Chris whipped up for us, I notice that all obsoleted versions of an object appear to get re-indexed with even a minor change to the latest version (e.g.: updating a fileName in the system metadata). This seems inefficient given that the older versions of these objects are not being changed nor updated in any way.

Metacat depends on DataONE's indexer, so many of these changes would need to occur there, but I am reporting this here to ensure it gets prioritized in Metacat's release schedule.

Consider:

  • reducing the number of indexing jobs by eliminating unnecessary ones
  • making each indexing job faster
  • parallel processing multiple index jobs on a cluster with many processing nodes

We should pursue each of these, but focus on the ones that have the largest performance gains first. The big bottlenecks occur when we submit many thousands of objects in a short period, so parallel processing using a cluster that can scale to hundreds or thousands of nodes might be the highest priority, followed by eliminating unneeded index jobs, and then finally speeding up each individual job.

Goals:

  • Goal: Process any given object in < 1s wall clock time
  • Goal: Parallel process up to 500 objects at a time
  • Goal: process each object once and only once
  • Goal: do not queue or index objects that do not need to be reindexed
@jagoldstein
Copy link

This issue is making work nearly impossible this afternoon. It is taking multiple hours to index a single, small data object or EML. While this persists, it is best if only one update is made at a time (possibly only one per day). It creates a really poor experience for submitters and inhibits us from hiring more interns to do processing as the system would not be able to support their work.

@amoeba
Copy link
Contributor

amoeba commented Jul 30, 2019

Hey @jagoldstein that's definitely not the desired behavior here.

For me and other devs: Looking at catalina.out, it's pretty clear that part of this is related to the cascade of updates a single index task triggers. Creating and inserting Solr docs itself stays fast, ~1 insert per second, but a single index task is causing many Solr docs to get inserted. For example, this call in R triggered a lot of Solr docs to get inserted:

# Other details removed for simplicity
publish_update(mn,
  metadata_pid = "urn:uuid:2a706b99-d138-4e92-a643-7688df172804",
  resource_map_pid = "resource_map_urn:uuid:2a706b99-d138-4e92-a643-7688df172804",
  data_pids = c("urn:uuid:a4b8d5ed-a2db-4eee-b9d1-ac537595f297", 
                "urn:uuid:1bf75e5f-28f7-489d-ae5b-a09dbfff41f3", 
                "urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf"))

At face value, that's a really innocuous-looking job to index. But when I look at catalina.out, I see something suspicious:

$ grep "inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf" /var/log/tomcat7/catalina.out
metacat-index 20190729-14:57:42: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:c29a75e6-8724-4358-999c-e9648e6e96bb, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-14:57:58: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:cbc7a1de-f75a-496f-ac3a-25a3a0f20089, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-14:58:06: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:c29a75e6-8724-4358-999c-e9648e6e96bb, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-14:58:22: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:cbc7a1de-f75a-496f-ac3a-25a3a0f20089, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-14:58:33: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:f37f5eeb-4ddb-4b60-9941-f2081346803b, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-14:58:42: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:c29a75e6-8724-4358-999c-e9648e6e96bb, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-14:58:59: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:cbc7a1de-f75a-496f-ac3a-25a3a0f20089, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-14:59:08: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:c29a75e6-8724-4358-999c-e9648e6e96bb, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-14:59:23: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:cbc7a1de-f75a-496f-ac3a-25a3a0f20089, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-15:02:34: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:f37f5eeb-4ddb-4b60-9941-f2081346803b, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-15:02:43: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:c29a75e6-8724-4358-999c-e9648e6e96bb, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-15:03:01: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:cbc7a1de-f75a-496f-ac3a-25a3a0f20089, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-15:03:12: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:f37f5eeb-4ddb-4b60-9941-f2081346803b, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-15:03:21: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:c29a75e6-8724-4358-999c-e9648e6e96bb, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-15:03:38: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:cbc7a1de-f75a-496f-ac3a-25a3a0f20089, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-15:03:51: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:2a706b99-d138-4e92-a643-7688df172804, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-15:04:00: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:f37f5eeb-4ddb-4b60-9941-f2081346803b, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-15:04:11: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:c29a75e6-8724-4358-999c-e9648e6e96bb, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-15:04:29: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:cbc7a1de-f75a-496f-ac3a-25a3a0f20089, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-15:04:40: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:f37f5eeb-4ddb-4b60-9941-f2081346803b, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-15:04:49: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:c29a75e6-8724-4358-999c-e9648e6e96bb, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-15:05:06: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:cbc7a1de-f75a-496f-ac3a-25a3a0f20089, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-16:34:52: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-16:36:09: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:2a706b99-d138-4e92-a643-7688df172804, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-16:36:18: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:f37f5eeb-4ddb-4b60-9941-f2081346803b, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-16:36:29: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:c29a75e6-8724-4358-999c-e9648e6e96bb, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-16:36:46: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:cbc7a1de-f75a-496f-ac3a-25a3a0f20089, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-16:36:59: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:2a706b99-d138-4e92-a643-7688df172804, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-16:37:08: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:f37f5eeb-4ddb-4b60-9941-f2081346803b, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-16:37:19: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:c29a75e6-8724-4358-999c-e9648e6e96bb, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-16:37:36: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:cbc7a1de-f75a-496f-ac3a-25a3a0f20089, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-16:37:48: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_doi:10.18739/A25X25C75, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-16:38:01: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:2a706b99-d138-4e92-a643-7688df172804, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-16:38:12: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:f37f5eeb-4ddb-4b60-9941-f2081346803b, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-16:38:23: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:c29a75e6-8724-4358-999c-e9648e6e96bb, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-16:38:40: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:cbc7a1de-f75a-496f-ac3a-25a3a0f20089, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-16:38:54: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:2a706b99-d138-4e92-a643-7688df172804, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-16:39:03: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:f37f5eeb-4ddb-4b60-9941-f2081346803b, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-16:39:14: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:c29a75e6-8724-4358-999c-e9648e6e96bb, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-16:39:31: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:cbc7a1de-f75a-496f-ac3a-25a3a0f20089, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-16:39:39: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]

The above output makes it look like the Solr doc for urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf (a data file) is getting updated a bunch (41) times and multiple times due to membership in the same ORE. For example, I see

inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:c29a75e6-8724-4358-999c-e9648e6e96bb

12 times. Mostly just putting these notes in here for now. Thanks again @jagoldstein

@mbjones
Copy link
Member Author

mbjones commented Feb 12, 2021

Increased the priority on this to critical due to the increasing number of packages we are processing with thousands of entries.

mbjones added a commit that referenced this issue Mar 1, 2022
@taojing2002 taojing2002 modified the milestones: 3.0.0, 2.19.0 Jan 25, 2023
@taojing2002 taojing2002 modified the milestones: 2.19.0, 3.0.0 Apr 5, 2023
@artntek artntek closed this as completed Feb 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants