Avoid std::future in S3 multipart upload #2851

Shelnutt2 · 2022-01-30T15:36:57Z

This solves a deadlock that can occur when all cores get blocked on waiting for the std::future get() call leading to thread starvation.

TYPE: BUG
DESC: Avoid thread starvation by removing std::future usage in S3 multipart upload

shortcut-integration · 2022-01-30T15:37:01Z

This pull request has been linked to Shortcut Story #13742: Eliminate blocking std::future<>::get() in s3.cc.

lums658

This removes the calls into S3 threading/asynchrony system and stays within TileDB. I think it may offer better concurrency as well as avoiding deadlock.

dhoke4tdb

Seems to me like these changes are still subject to the same potential failure as the original.

presumption - the vfs_thread_pool_ is the same thread_pool (already) running the submitting tasks (true/false?)

Assuming true...

parallel_for() places the tasks into the vfs_thread_pool_ task queue (which in the original scenario, already had all task resources occupied by the submitting threads), and then does a tp->wait_all(tasks) on all of those (S3 activity) tasks.

Apart from perhaps having slightly different 'jitter' from the original, still seems this approach is subject to having all vfs_thread_pool_ available tasks occupied by the invoking/submitting threads, and hence no resource available to run the tasks submitted for the S3 "jobs" in parallel_for() (because they are already blocked by the submitting task threads.)

Am I missing something?

lums658 · 2022-01-31T20:10:25Z

The difference is that wait_all will make progress on pending tasks rather than blocking in get().

dhoke4tdb · 2022-01-31T20:21:11Z

in the minus one (-1) threadpool yes, but, is the new threadpool changing to the minus one (-1) approach?

And, in discussions, I think it's been agreed amongst some, that the problem -could- be seen under unfortunate circumstances even with the minus one (-1) design...

dhoke4tdb · 2022-02-01T00:50:25Z

I extracted the changes in this branch and applied them on top of a branch with the new threadpool and am still seeing a hang in that.

It doesn't look quite like the old one, seems to only have two threads 'actively' involved, but I think it's still a similar hang (I didn't quite understand the 3 threads in the earlier version, so this actually matches more what I would have expected.)

lums658 · 2022-02-01T02:40:28Z

There is an actual deadlock here. S3::write_multipart() locks state->mtx. While holding that, it calls S3::get_make_upload_part_req(), which then tries to lock state->mtx. I suggest unlocking right before calling the parallel_for (and lock again before updating state->part_number).

(So far on testing this seems to be working, but I am going to let the testing run for a while longer.)

Would it be better to make state->part_number an atomic and not lock so much?

dhoke4tdb · 2022-02-01T09:42:39Z

I'm now getting a SEGV, Debug config build wsl *nix of commit 9bdbb56, prev repro scenario (concurrency_level==2),
3 attempts all seem to fail similarly

#2 0x00005555561aa0b6 in Aws::Utils::Outcome<Aws::S3::Model::UploadPartResult, Aws::S3::S3Error>::operator= (
this=0x5555647dd730, o=...)
at /mnt/d/dev/tiledb/gh.lums-gdb-thdpool.git/bld.dbg.ubu2004.F2/externals/install/include/aws/core/utils/Outcome.h:1

 116                     Outcome& operator=(Outcome&& o) 
 117                     { 
 118                         if (this != &o)
 119                         {
>120                             result = std::move(o.result);
 121                             error = std::move(o.error);
 122                             success = o.success;
 123                         }
 124

This solves a deadlock that can occur when all cores get blocked on waiting for the std::future get() call leading to thread starvation.

Shelnutt2 requested review from lums658 and dhoke4tdb January 30, 2022 15:36

lums658 approved these changes Jan 31, 2022

View reviewed changes

dhoke4tdb suggested changes Jan 31, 2022

View reviewed changes

Shelnutt2 force-pushed the sethshelnutt/sc-13742/eliminate-blocking-std-future-get-in-s3-cc branch from 99d8620 to 9bdbb56 Compare February 1, 2022 04:01

Shelnutt2 force-pushed the sethshelnutt/sc-13742/eliminate-blocking-std-future-get-in-s3-cc branch 4 times, most recently from 04cea0c to ffa297b Compare February 2, 2022 15:58

Avoid std::future in S3 multipart upload

7d02d21

This solves a deadlock that can occur when all cores get blocked on waiting for the std::future get() call leading to thread starvation.

Shelnutt2 force-pushed the sethshelnutt/sc-13742/eliminate-blocking-std-future-get-in-s3-cc branch from ffa297b to 7d02d21 Compare February 4, 2022 21:44

Shelnutt2 requested a review from dhoke4tdb February 9, 2022 11:41

dhoke4tdb approved these changes Feb 9, 2022

View reviewed changes

Shelnutt2 merged commit 03e3af6 into dev Feb 9, 2022

Shelnutt2 deleted the sethshelnutt/sc-13742/eliminate-blocking-std-future-get-in-s3-cc branch February 9, 2022 12:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid std::future in S3 multipart upload #2851

Avoid std::future in S3 multipart upload #2851

Shelnutt2 commented Jan 30, 2022

shortcut-integration bot commented Jan 30, 2022

lums658 left a comment

dhoke4tdb left a comment •

edited

lums658 commented Jan 31, 2022

dhoke4tdb commented Jan 31, 2022

dhoke4tdb commented Feb 1, 2022

lums658 commented Feb 1, 2022 •

edited

dhoke4tdb commented Feb 1, 2022 •

edited

Avoid std::future in S3 multipart upload #2851

Avoid std::future in S3 multipart upload #2851

Conversation

Shelnutt2 commented Jan 30, 2022

shortcut-integration bot commented Jan 30, 2022

lums658 left a comment

Choose a reason for hiding this comment

dhoke4tdb left a comment • edited

Choose a reason for hiding this comment

lums658 commented Jan 31, 2022

dhoke4tdb commented Jan 31, 2022

dhoke4tdb commented Feb 1, 2022

lums658 commented Feb 1, 2022 • edited

dhoke4tdb commented Feb 1, 2022 • edited

dhoke4tdb left a comment •

edited

lums658 commented Feb 1, 2022 •

edited

dhoke4tdb commented Feb 1, 2022 •

edited