Bulk closing many objects may cause dor-services-app to crash #4514

andrewjbtw · 2023-05-22T21:55:06Z

Describe the problem

I've been bulk updating objects in batches between 10,000 and 50,000 druids in size. The end of this process involves running the bulk action for closing objects. I've noticed that when I run bulk action close on a batch of 50,000, almost inevitably there will be a problem with dor-services-app at some point in the process.

Usually, this shows up as HB errors in common-accessioning that just say "unable to reach dor-services-app", like these:
https://app.honeybadger.io/projects/52894/faults/86609367
https://app.honeybadger.io/projects/52894/faults/95143946

The bulk action will still complete but then I have to follow up by going through the workflow grid and addressing whatever errors occurred when dor-services-app was "unreachable." I've been trying to run the close actions at night in order to avoid too much disruption to regular accessioning.

Additional context

In the Fedora era, there were similar problems when running a bulk close, but they were triggered by batches much lower in size (less than 5000 druids) and clean up afterwards was much more difficult. So the system has improved a considerable amount. But a sustained high load of accessioning still appears to cause disruptions.

andrewjbtw · 2023-05-23T22:00:28Z

I've been thinking about how to reproduce this. I still have many batches of 50,000 items to update, so if someone is able to investigate, we can coordinate to see how a few job could be monitored. To avoid disrupting other users, I've been running bulk actions after hours so the errors have been happening mainly at night.

peetucket · 2023-05-24T23:06:14Z

My read of the code suggests that the CloseVersionjob in Argo just fires off requests to DSA (via the dor-services-client). Presumably if the jobs run quickly this is essentially a DOS attack on DSA which just can't handle the high concurrent load as it tries to close each object (which involves making calls to the workflow service in addition to writing to the events table).

Unclear how to deal with this (perhaps something in sidekiq to throttle job execution ?)

mjgiarlo · 2023-05-25T15:56:00Z

There are many possible changes we can make to deal with an apparent DOS, and I'm wondering what the specific failure mode is on the DSA side. So, I agree with everything ☝🏻 re: the importance of reproducing this. It'd be good to know the root problem first. E.g.:

Is one of the DSA VMs down?
Or multiple of them?
And/or the load-balancer?
If the VMs are responsive, did Apache/Passenger run out of connections?
Or the database?
Or did DSA have a problem with its connection to the workflow service?

Based on what we learn, we could consider adding nodes to the load-balancer or the database cluster, or look into robustifying the Apache and/or Passenger configuration to allow more connections, etc. We could also consider changes to the code on either or both the Argo and DSA sides: the Argo job could send over more information in bulk; the DSA work could be made async; we could consider using messaging instead of synchronous HTTP API calls; etc. There are many ways to address this, whatever "this" is. 😄

justinlittman · 2023-08-21T14:52:35Z

While the close version bulk action involves a lot of activity, it all should be serial / synchronous so this a bit surprising. Agree with @mjgiarlo that we need to reproduce to understand how to best address the problem.

Note that recent changes to DSA VersionService (21d760e) make closes more efficient by not requiring the cocina object to be loaded.

ndushay · 2023-09-12T22:52:05Z

Andrew says he has a test object for this; it is an object with tons of small files. (?)

andrewjbtw · 2023-09-13T18:16:10Z

Not quite that easy. Testing this needs thousands of objects but the changes being made must be small so that accessioning runs at a high rate.

justinlittman · 2023-09-28T17:12:13Z

A few observations:

Each call to DSA close version endpoint may result in multiple calls to Workflow.
DSA has 2 web servers (each with 20 passenger processes handled by 12CPU, 8GB RAM)
Workflow has a single web server (with 100 passenger processes handled by 8CPU, 16GB RAM)

Noting:

100 passenger processes on Workflow seems crazy pants. That only 160MB per process.
Workflow should probably have multiple web servers.

justinlittman · 2023-09-28T19:41:29Z

To bulk create objects:

gem install sdr-client
export SDR_API=https://sdr-api-stage.stanford.edu
sdr login --url "$SDR_API"
echo "1" > test.txt
export APO=druid:zm491tx1704
for i in {1..5000}
do
sdr deposit test.txt --url "$SDR_API" --label "sample deposit" --admin-policy "$APO" --source-id "jlit-test:$(uuidgen)" --type "object" --view "world"
done

justinlittman · 2023-10-03T11:06:48Z

I was unable to reproduce this in stage:
Private Zenhub Image

Given that (1) there have been multiple changes to SDR/DSA since the original problem and (2) there are significant differences between stage and production, it is hard to know what to conclude from this.

@andrewjbtw How would you like to proceed?

andrewjbtw added the bug label May 22, 2023

andrewjbtw added this to New Issues (Needs Triage) in Infrastructure Portfolio Production Priorities (MOVED TO https://github.com/orgs/sul-dlss/projects/58) May 23, 2023

andrewjbtw removed this from New Issues (Needs Triage) in Infrastructure Portfolio Production Priorities (MOVED TO https://github.com/orgs/sul-dlss/projects/58) Aug 29, 2023

justinlittman self-assigned this Sep 28, 2023

justinlittman assigned andrewjbtw and unassigned justinlittman Oct 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bulk closing many objects may cause dor-services-app to crash #4514

Bulk closing many objects may cause dor-services-app to crash #4514

andrewjbtw commented May 22, 2023

andrewjbtw commented May 23, 2023

peetucket commented May 24, 2023

mjgiarlo commented May 25, 2023 •

edited

justinlittman commented Aug 21, 2023

ndushay commented Sep 12, 2023

andrewjbtw commented Sep 13, 2023

justinlittman commented Sep 28, 2023

justinlittman commented Sep 28, 2023

justinlittman commented Oct 3, 2023

Bulk closing many objects may cause dor-services-app to crash #4514

Bulk closing many objects may cause dor-services-app to crash #4514

Comments

andrewjbtw commented May 22, 2023

andrewjbtw commented May 23, 2023

peetucket commented May 24, 2023

mjgiarlo commented May 25, 2023 • edited

justinlittman commented Aug 21, 2023

ndushay commented Sep 12, 2023

andrewjbtw commented Sep 13, 2023

justinlittman commented Sep 28, 2023

justinlittman commented Sep 28, 2023

justinlittman commented Oct 3, 2023

mjgiarlo commented May 25, 2023 •

edited