Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulk closing many objects may cause dor-services-app to crash #4514

Open
andrewjbtw opened this issue May 22, 2023 · 9 comments
Open

Bulk closing many objects may cause dor-services-app to crash #4514

andrewjbtw opened this issue May 22, 2023 · 9 comments
Assignees
Labels

Comments

@andrewjbtw
Copy link

Describe the problem

I've been bulk updating objects in batches between 10,000 and 50,000 druids in size. The end of this process involves running the bulk action for closing objects. I've noticed that when I run bulk action close on a batch of 50,000, almost inevitably there will be a problem with dor-services-app at some point in the process.

Usually, this shows up as HB errors in common-accessioning that just say "unable to reach dor-services-app", like these:
https://app.honeybadger.io/projects/52894/faults/86609367
https://app.honeybadger.io/projects/52894/faults/95143946

The bulk action will still complete but then I have to follow up by going through the workflow grid and addressing whatever errors occurred when dor-services-app was "unreachable." I've been trying to run the close actions at night in order to avoid too much disruption to regular accessioning.

Additional context

In the Fedora era, there were similar problems when running a bulk close, but they were triggered by batches much lower in size (less than 5000 druids) and clean up afterwards was much more difficult. So the system has improved a considerable amount. But a sustained high load of accessioning still appears to cause disruptions.

@andrewjbtw
Copy link
Author

I've been thinking about how to reproduce this. I still have many batches of 50,000 items to update, so if someone is able to investigate, we can coordinate to see how a few job could be monitored. To avoid disrupting other users, I've been running bulk actions after hours so the errors have been happening mainly at night.

@peetucket
Copy link
Member

My read of the code suggests that the CloseVersionjob in Argo just fires off requests to DSA (via the dor-services-client). Presumably if the jobs run quickly this is essentially a DOS attack on DSA which just can't handle the high concurrent load as it tries to close each object (which involves making calls to the workflow service in addition to writing to the events table).

Unclear how to deal with this (perhaps something in sidekiq to throttle job execution ?)

@mjgiarlo
Copy link
Member

mjgiarlo commented May 25, 2023

There are many possible changes we can make to deal with an apparent DOS, and I'm wondering what the specific failure mode is on the DSA side. So, I agree with everything ☝🏻 re: the importance of reproducing this. It'd be good to know the root problem first. E.g.:

  • Is one of the DSA VMs down?
  • Or multiple of them?
  • And/or the load-balancer?
  • If the VMs are responsive, did Apache/Passenger run out of connections?
  • Or the database?
  • Or did DSA have a problem with its connection to the workflow service?

Based on what we learn, we could consider adding nodes to the load-balancer or the database cluster, or look into robustifying the Apache and/or Passenger configuration to allow more connections, etc. We could also consider changes to the code on either or both the Argo and DSA sides: the Argo job could send over more information in bulk; the DSA work could be made async; we could consider using messaging instead of synchronous HTTP API calls; etc. There are many ways to address this, whatever "this" is. 😄

@justinlittman
Copy link
Contributor

While the close version bulk action involves a lot of activity, it all should be serial / synchronous so this a bit surprising. Agree with @mjgiarlo that we need to reproduce to understand how to best address the problem.

Note that recent changes to DSA VersionService (21d760e) make closes more efficient by not requiring the cocina object to be loaded.

@ndushay
Copy link
Contributor

ndushay commented Sep 12, 2023

Andrew says he has a test object for this; it is an object with tons of small files. (?)

@andrewjbtw
Copy link
Author

Not quite that easy. Testing this needs thousands of objects but the changes being made must be small so that accessioning runs at a high rate.

@justinlittman
Copy link
Contributor

A few observations:

  • Each call to DSA close version endpoint may result in multiple calls to Workflow.
  • DSA has 2 web servers (each with 20 passenger processes handled by 12CPU, 8GB RAM)
  • Workflow has a single web server (with 100 passenger processes handled by 8CPU, 16GB RAM)

Noting:

  • 100 passenger processes on Workflow seems crazy pants. That only 160MB per process.
  • Workflow should probably have multiple web servers.

@justinlittman justinlittman self-assigned this Sep 28, 2023
@justinlittman
Copy link
Contributor

To bulk create objects:

gem install sdr-client
export SDR_API=https://sdr-api-stage.stanford.edu
sdr login --url "$SDR_API"
echo "1" > test.txt
export APO=druid:zm491tx1704
for i in {1..5000}
do
sdr deposit test.txt --url "$SDR_API" --label "sample deposit" --admin-policy "$APO" --source-id "jlit-test:$(uuidgen)" --type "object" --view "world"
done

@justinlittman
Copy link
Contributor

I was unable to reproduce this in stage:
Private Zenhub Image

Given that (1) there have been multiple changes to SDR/DSA since the original problem and (2) there are significant differences between stage and production, it is hard to know what to conclude from this.

@andrewjbtw How would you like to proceed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants