speed up indexing #1364

mbjones · 2019-06-28T22:17:14Z

Indexing new content in Metacat is a bottleneck that needs to be fixed. Here's a report from Jesse G about the issues they are seeing.

I'm writing to express concern for the speed of the indexing queue. It's sluggishness continues to be an obstacle to the support team's ability to efficiently process and publish data packages. This is especially true for resource maps that aggregate 100s of data objects.

We have been putting off publishing 187,000 data files (submitted by Jeremy May) due to our concern that we'll backlog the index. Now that I am able to peek inside the queue, thanks to the command line tool that Chris whipped up for us, I notice that all obsoleted versions of an object appear to get re-indexed with even a minor change to the latest version (e.g.: updating a fileName in the system metadata). This seems inefficient given that the older versions of these objects are not being changed nor updated in any way.

Metacat depends on DataONE's indexer, so many of these changes would need to occur there, but I am reporting this here to ensure it gets prioritized in Metacat's release schedule.

Consider:

reducing the number of indexing jobs by eliminating unnecessary ones
making each indexing job faster
parallel processing multiple index jobs on a cluster with many processing nodes

We should pursue each of these, but focus on the ones that have the largest performance gains first. The big bottlenecks occur when we submit many thousands of objects in a short period, so parallel processing using a cluster that can scale to hundreds or thousands of nodes might be the highest priority, followed by eliminating unneeded index jobs, and then finally speeding up each individual job.

Goals:

Goal: Process any given object in < 1s wall clock time
Goal: Parallel process up to 500 objects at a time
Goal: process each object once and only once
Goal: do not queue or index objects that do not need to be reindexed

jagoldstein · 2019-07-29T23:21:40Z

This issue is making work nearly impossible this afternoon. It is taking multiple hours to index a single, small data object or EML. While this persists, it is best if only one update is made at a time (possibly only one per day). It creates a really poor experience for submitters and inhibits us from hiring more interns to do processing as the system would not be able to support their work.

amoeba · 2019-07-30T00:10:09Z

Hey @jagoldstein that's definitely not the desired behavior here.

For me and other devs: Looking at catalina.out, it's pretty clear that part of this is related to the cascade of updates a single index task triggers. Creating and inserting Solr docs itself stays fast, ~1 insert per second, but a single index task is causing many Solr docs to get inserted. For example, this call in R triggered a lot of Solr docs to get inserted:

# Other details removed for simplicity
publish_update(mn,
  metadata_pid = "urn:uuid:2a706b99-d138-4e92-a643-7688df172804",
  resource_map_pid = "resource_map_urn:uuid:2a706b99-d138-4e92-a643-7688df172804",
  data_pids = c("urn:uuid:a4b8d5ed-a2db-4eee-b9d1-ac537595f297", 
                "urn:uuid:1bf75e5f-28f7-489d-ae5b-a09dbfff41f3", 
                "urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf"))

At face value, that's a really innocuous-looking job to index. But when I look at catalina.out, I see something suspicious:

$ grep "inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf" /var/log/tomcat7/catalina.out
metacat-index 20190729-14:57:42: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:c29a75e6-8724-4358-999c-e9648e6e96bb, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-14:57:58: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:cbc7a1de-f75a-496f-ac3a-25a3a0f20089, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-14:58:06: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:c29a75e6-8724-4358-999c-e9648e6e96bb, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-14:58:22: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:cbc7a1de-f75a-496f-ac3a-25a3a0f20089, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-14:58:33: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:f37f5eeb-4ddb-4b60-9941-f2081346803b, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-14:58:42: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:c29a75e6-8724-4358-999c-e9648e6e96bb, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-14:58:59: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:cbc7a1de-f75a-496f-ac3a-25a3a0f20089, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-14:59:08: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:c29a75e6-8724-4358-999c-e9648e6e96bb, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-14:59:23: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:cbc7a1de-f75a-496f-ac3a-25a3a0f20089, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-15:02:34: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:f37f5eeb-4ddb-4b60-9941-f2081346803b, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-15:02:43: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:c29a75e6-8724-4358-999c-e9648e6e96bb, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-15:03:01: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:cbc7a1de-f75a-496f-ac3a-25a3a0f20089, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-15:03:12: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:f37f5eeb-4ddb-4b60-9941-f2081346803b, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-15:03:21: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:c29a75e6-8724-4358-999c-e9648e6e96bb, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-15:03:38: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:cbc7a1de-f75a-496f-ac3a-25a3a0f20089, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-15:03:51: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:2a706b99-d138-4e92-a643-7688df172804, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-15:04:00: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:f37f5eeb-4ddb-4b60-9941-f2081346803b, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-15:04:11: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:c29a75e6-8724-4358-999c-e9648e6e96bb, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-15:04:29: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:cbc7a1de-f75a-496f-ac3a-25a3a0f20089, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-15:04:40: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:f37f5eeb-4ddb-4b60-9941-f2081346803b, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-15:04:49: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:c29a75e6-8724-4358-999c-e9648e6e96bb, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-15:05:06: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:cbc7a1de-f75a-496f-ac3a-25a3a0f20089, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-16:34:52: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-16:36:09: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:2a706b99-d138-4e92-a643-7688df172804, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-16:36:18: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:f37f5eeb-4ddb-4b60-9941-f2081346803b, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-16:36:29: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:c29a75e6-8724-4358-999c-e9648e6e96bb, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-16:36:46: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:cbc7a1de-f75a-496f-ac3a-25a3a0f20089, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-16:36:59: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:2a706b99-d138-4e92-a643-7688df172804, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-16:37:08: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:f37f5eeb-4ddb-4b60-9941-f2081346803b, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-16:37:19: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:c29a75e6-8724-4358-999c-e9648e6e96bb, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-16:37:36: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:cbc7a1de-f75a-496f-ac3a-25a3a0f20089, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-16:37:48: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_doi:10.18739/A25X25C75, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-16:38:01: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:2a706b99-d138-4e92-a643-7688df172804, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-16:38:12: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:f37f5eeb-4ddb-4b60-9941-f2081346803b, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-16:38:23: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:c29a75e6-8724-4358-999c-e9648e6e96bb, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-16:38:40: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:cbc7a1de-f75a-496f-ac3a-25a3a0f20089, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-16:38:54: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:2a706b99-d138-4e92-a643-7688df172804, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-16:39:03: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:f37f5eeb-4ddb-4b60-9941-f2081346803b, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-16:39:14: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:c29a75e6-8724-4358-999c-e9648e6e96bb, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-16:39:31: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:cbc7a1de-f75a-496f-ac3a-25a3a0f20089, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]
metacat-index 20190729-16:39:39: [DEBUG]: SolrIndex.insert - inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, into the solr server. [edu.ucsb.nceas.metacat.index.SolrIndex:insert:435]

The above output makes it look like the Solr doc for urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf (a data file) is getting updated a bunch (41) times and multiple times due to membership in the same ORE. For example, I see

inserted the solr-doc object of pid urn:uuid:b550509d-4442-4688-9ccf-adbf241277bf, which relates to object resource_map_urn:uuid:c29a75e6-8724-4358-999c-e9648e6e96bb

12 times. Mostly just putting these notes in here for now. Thanks again @jagoldstein

mbjones · 2021-02-12T20:37:37Z

Increased the priority on this to critical due to the increasing number of packages we are processing with thousands of entries.

See issue #1436 and and #1364.

mbjones added this to the 3.0.0 milestone Jun 28, 2019

mbjones assigned amoeba, csjx, gothub and taojing2002 Jun 28, 2019

mbjones added Category: index Priority: High labels Jun 28, 2019

mbjones added Priority: Critical and removed Priority: High labels Feb 12, 2021

mbjones added a commit that referenced this issue Mar 1, 2022

Development docs for storage and indexing refactor.

c80d5d2

See issue #1436 and and #1364.

taojing2002 modified the milestones: 3.0.0, 2.19.0 Jan 25, 2023

taojing2002 modified the milestones: 2.19.0, 3.0.0 Apr 5, 2023

taojing2002 added the Epic label Sep 7, 2023

taojing2002 mentioned this issue Nov 4, 2023

Refactor indexing (CI-02) #1599

Closed

artntek closed this as completed Feb 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

speed up indexing #1364

speed up indexing #1364

mbjones commented Jun 28, 2019

jagoldstein commented Jul 29, 2019

amoeba commented Jul 30, 2019

mbjones commented Feb 12, 2021

speed up indexing #1364

speed up indexing #1364

Comments

mbjones commented Jun 28, 2019

jagoldstein commented Jul 29, 2019

amoeba commented Jul 30, 2019

mbjones commented Feb 12, 2021