Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid std::future in S3 multipart upload #2851

Conversation

Shelnutt2
Copy link
Member

This solves a deadlock that can occur when all cores get blocked on waiting for the std::future get() call leading to thread starvation.


TYPE: BUG
DESC: Avoid thread starvation by removing std::future usage in S3 multipart upload

@shortcut-integration
Copy link

This pull request has been linked to Shortcut Story #13742: Eliminate blocking std::future<>::get() in s3.cc.

Copy link
Contributor

@lums658 lums658 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This removes the calls into S3 threading/asynchrony system and stays within TileDB. I think it may offer better concurrency as well as avoiding deadlock.

Copy link
Contributor

@dhoke4tdb dhoke4tdb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to me like these changes are still subject to the same potential failure as the original.

presumption - the vfs_thread_pool_ is the same thread_pool (already) running the submitting tasks (true/false?)

Assuming true...

parallel_for() places the tasks into the vfs_thread_pool_ task queue (which in the original scenario, already had all task resources occupied by the submitting threads), and then does a tp->wait_all(tasks) on all of those (S3 activity) tasks.

Apart from perhaps having slightly different 'jitter' from the original, still seems this approach is subject to having all vfs_thread_pool_ available tasks occupied by the invoking/submitting threads, and hence no resource available to run the tasks submitted for the S3 "jobs" in parallel_for() (because they are already blocked by the submitting task threads.)

Am I missing something?

@lums658
Copy link
Contributor

lums658 commented Jan 31, 2022

The difference is that wait_all will make progress on pending tasks rather than blocking in get().

@dhoke4tdb
Copy link
Contributor

in the minus one (-1) threadpool yes, but, is the new threadpool changing to the minus one (-1) approach?

And, in discussions, I think it's been agreed amongst some, that the problem -could- be seen under unfortunate circumstances even with the minus one (-1) design...

@dhoke4tdb
Copy link
Contributor

I extracted the changes in this branch and applied them on top of a branch with the new threadpool and am still seeing a hang in that.

It doesn't look quite like the old one, seems to only have two threads 'actively' involved, but I think it's still a similar hang (I didn't quite understand the 3 threads in the earlier version, so this actually matches more what I would have expected.)

@lums658
Copy link
Contributor

lums658 commented Feb 1, 2022

There is an actual deadlock here. S3::write_multipart() locks state->mtx. While holding that, it calls S3::get_make_upload_part_req(), which then tries to lock state->mtx. I suggest unlocking right before calling the parallel_for (and lock again before updating state->part_number).

(So far on testing this seems to be working, but I am going to let the testing run for a while longer.)

Would it be better to make state->part_number an atomic and not lock so much?

@Shelnutt2 Shelnutt2 force-pushed the sethshelnutt/sc-13742/eliminate-blocking-std-future-get-in-s3-cc branch from 99d8620 to 9bdbb56 Compare February 1, 2022 04:01
@dhoke4tdb
Copy link
Contributor

dhoke4tdb commented Feb 1, 2022

I'm now getting a SEGV, Debug config build wsl *nix of commit 9bdbb56, prev repro scenario (concurrency_level==2),
3 attempts all seem to fail similarly

#2 0x00005555561aa0b6 in Aws::Utils::Outcome<Aws::S3::Model::UploadPartResult, Aws::S3::S3Error>::operator= (
this=0x5555647dd730, o=...)
at /mnt/d/dev/tiledb/gh.lums-gdb-thdpool.git/bld.dbg.ubu2004.F2/externals/install/include/aws/core/utils/Outcome.h:1

 116                     Outcome& operator=(Outcome&& o) 
 117                     { 
 118                         if (this != &o)
 119                         {
>120                             result = std::move(o.result);
 121                             error = std::move(o.error);
 122                             success = o.success;
 123                         }
 124                                                           

@Shelnutt2 Shelnutt2 force-pushed the sethshelnutt/sc-13742/eliminate-blocking-std-future-get-in-s3-cc branch 4 times, most recently from 04cea0c to ffa297b Compare February 2, 2022 15:58
This solves a deadlock that can occur when all cores get blocked on
waiting for the std::future get() call leading to thread starvation.
@Shelnutt2 Shelnutt2 force-pushed the sethshelnutt/sc-13742/eliminate-blocking-std-future-get-in-s3-cc branch from ffa297b to 7d02d21 Compare February 4, 2022 21:44
@Shelnutt2 Shelnutt2 merged commit 03e3af6 into dev Feb 9, 2022
@Shelnutt2 Shelnutt2 deleted the sethshelnutt/sc-13742/eliminate-blocking-std-future-get-in-s3-cc branch February 9, 2022 12:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants