c++20 coroutine based version is not as fast as the one using boost fiber? #3

npuichigo · 2021-10-28T11:17:30Z

Recently I ran the grpc_bench to compare the performance of different settings. I found the coroutine based one is slower than both the boost fiber version and the grpc multi-thread version. Do you have any insight about this?

Tradias · 2021-10-28T16:09:34Z

I ran the cpp_grpc_mt_bench, cpp_asio_grpc_bench and a modified version of cpp_asio_grpc_bench that uses C++20 coroutines with GRPC_SERVER_CPUS=1 on a Windows machine:

Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz, Boost 1.77, gRPC 1.37.0, asio-grpc v1.2.0

with the command:

ghz --proto=asio-grpc\example\protos\helloworld.proto --call=helloworld.Greeter.SayHello --cpus 7 --insecure --concurrency=1000 --connections=50 --duration 20s --data-file 100B.txt 127.0.0.1:50051

Boost.Coroutine:

Summary:
  Count:        423578
  Total:        20.03 s
  Slowest:      582.23 ms
  Fastest:      0 ns
  Average:      26.45 ms
  Requests/sec: 21151.68

Response time histogram:
  0.000   [635]    |
  58.223  [407360] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  116.446 [12220]  |∎
  174.670 [1388]   |
  232.893 [21]     |
  291.116 [0]      |
  349.339 [0]      |
  407.563 [0]      |
  465.786 [247]    |
  524.009 [565]    |
  582.232 [188]    |

Latency distribution:
  10 % in 6.40 ms
  25 % in 14.00 ms
  50 % in 24.00 ms
  75 % in 32.45 ms
  90 % in 44.49 ms
  95 % in 53.04 ms
  99 % in 93.99 ms

Status code distribution:
  [OK]            422624 responses
  [Canceled]      517 responses
  [Unavailable]   437 responses

Error distribution:
  [517]   rpc error: code = Canceled desc = grpc: the client connection is closing
  [437]   rpc error: code = Unavailable desc = transport is closing

C++20 coroutines:

Summary:
  Count:        415305
  Total:        20.02 s
  Slowest:      1.07 s
  Fastest:      0 ns
  Average:      27.42 ms
  Requests/sec: 20746.18

Response time histogram:
  0.000    [885]    |
  107.035  [411518] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  214.069  [1553]   |
  321.104  [1]      |
  428.139  [0]      |
  535.173  [0]      |
  642.208  [0]      |
  749.243  [0]      |
  856.278  [0]      |
  963.312  [144]    |
  1070.347 [856]    |

Latency distribution:
  10 % in 5.99 ms
  25 % in 13.00 ms
  50 % in 23.31 ms
  75 % in 32.23 ms
  90 % in 45.08 ms
  95 % in 56.09 ms
  99 % in 93.16 ms

Status code distribution:
  [OK]            414957 responses
  [Canceled]      131 responses
  [Unavailable]   217 responses

Error distribution:
  [131]   rpc error: code = Canceled desc = grpc: the client connection is closing
  [217]   rpc error: code = Unavailable desc = transport is closing

gRPC multi-threaded example

Summary:
  Count:        413776
  Total:        20.01 s
  Slowest:      1.18 s
  Fastest:      0 ns
  Average:      27.10 ms
  Requests/sec: 20673.49

Response time histogram:
  0.000    [986]    |
  118.369  [409658] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  236.738  [1266]   |
  355.107  [0]      |
  473.477  [0]      |
  591.846  [0]      |
  710.215  [0]      |
  828.584  [0]      |
  946.953  [163]    |
  1065.322 [721]    |
  1183.692 [116]    |

Latency distribution:
  10 % in 5.89 ms
  25 % in 13.00 ms
  50 % in 23.02 ms
  75 % in 32.00 ms
  90 % in 44.00 ms
  95 % in 53.70 ms
  99 % in 93.75 ms

Status code distribution:
  [OK]            412910 responses
  [Canceled]      220 responses
  [Unavailable]   646 responses

Error distribution:
  [646]   rpc error: code = Unavailable desc = transport is closing
  [220]   rpc error: code = Canceled desc = grpc: the client connection is closing

I would say there is no significant different.
Results on Linux might vary as seen in the README. Slightly disappointing in my opinion to see such a difference between the gRPC multi-threaded example and my library on Linux, but then again, the example is hardly scalable to real world applications.

As for the difference between Boost.Coroutine and C++20 coroutines, I cannot say either. Boost.Asio's built-in memory recycling for C++20 coroutines is definitely active and working in my library. I think that is all I can do. We might just have to wait until compilers are better at optimizing C++20 coroutines. I still recommend to use them in-place of Boost.Coroutines.

I also ran the benchmarks through Intel VTune and cannot see anything immediately wrong, at least nothing seems to indicate that my implementation is causing some noticeable slowdown:

npuichigo · 2021-10-28T16:22:04Z

Oh, I left out some information. I can get the comparable result with single thread. But the issue I mentioned is using multi-thread, maybe four with.

Could u also provide a benchmark using multiple threads?

Tradias · 2021-10-28T16:23:05Z

Sure, although the maximum my machine can saturate are 2 threads :)

npuichigo · 2021-10-28T16:28:04Z

Anyway, thanks for your work. I’m trying to adapt your work with libunifex recently since it’s a more light-weight solution. 😀

Tradias · 2021-10-28T16:59:06Z

You're welcome. I might as well hack on libunifex a bit myself now :). Doesn't look much different from Boost.Asio, just lacking documentation

npuichigo · 2021-10-29T00:46:11Z

just finished a part of adaptation here.

Tradias · 2021-10-29T07:37:20Z

wow that is so cool!

Tradias · 2021-10-31T11:51:24Z

I have updated the benchmarks in the README. They actually show better performance with C++20 coroutines compared to Boost.Coroutine on my Linux machine.

Also note that if you are running benchmarks for 4 CPU servers then you really need to ensure that the server is fully exhausted. E.g. on my 12 core machine using 8 cores for the client and 4 for the server I am unable to do so:

name	req/s	avg. latency	90 % in	95 % in	99 % in	avg. cpu	avg. memory
go_grpc	86840	8.74 ms	16.41 ms	19.92 ms	28.71 ms	247.3%	28.08 MiB
cpp_grpc_callback	85723	9.97 ms	15.16 ms	17.95 ms	24.42 ms	215.32%	158.87 MiB
cpp_asio_grpc_cpp20_coroutine	84986	8.52 ms	16.46 ms	22.15 ms	34.92 ms	230.18%	66.54 MiB
rust_tonic_mt	84591	9.04 ms	17.39 ms	22.55 ms	34.38 ms	281.41%	18.87 MiB
cpp_grpc_mt	83895	8.55 ms	16.86 ms	23.15 ms	37.01 ms	227.57%	67.24 MiB
cpp_asio_grpc_boost_coroutine	83631	8.66 ms	16.96 ms	22.62 ms	36.11 ms	231.3%	66.75 MiB
rust_grpcio	82151	9.31 ms	17.01 ms	22.90 ms	36.57 ms	274.72%	34.45 MiB
rust_thruster_mt	79143	9.96 ms	19.58 ms	25.58 ms	37.57 ms	288.98%	15.47 MiB

Notice the avg. cpu column should be 400% for the server to be exhausted. Since it is not, these results are completely useless.

Tradias · 2021-10-31T11:59:32Z

For the upcoming version of asio-grpc I am planning to:

Boost-less version that uses only standalone asio (already done on master)
Initial support for unified executors with libunifex

npuichigo · 2021-10-31T16:07:56Z

Thanks for the info. I’d like to reproduce with my machine. By the way, will repeatedly_request improve the performance? What if there’s not enough request calls to match incoming requests?

Tradias · 2021-10-31T17:04:10Z

Yes good question. I couldn't find any information on "how many outstanding request calls" there should be at a time. Actually I just tested and it seems that if there are multiple outstanding calls to RequestXXX then all of them will handle the incoming RPC simultaneously. So I suppose that having exactly one is correct and that is what repeatedly_request is doing. E.g. the code in the benchmark could be rewritten to avoid coroutines entirely and only rely on callbacks which might yield better performance but I would say that such code is hardly scalable to real work applications:

struct ProcessRPC
{
    using executor_type = agrpc::GrpcContext::executor_type;

    agrpc::GrpcContext& grpc_context;

    auto get_executor() const noexcept { return grpc_context.get_executor(); }

    template <class RPCHandler>
    void operator()(RPCHandler&& rpc_handler, bool ok)
    {
        if (!ok)
        {
            return;
        }
        auto args = rpc_handler.args();
        auto response = std::allocate_shared<test::v1::Response>(grpc_context.get_allocator());
        response->set_integer(21);
        auto& response_ref = *response;
        agrpc::finish(std::get<2>(args), response_ref, grpc::Status::OK,
                      asio::bind_executor(this->get_executor(), [rpc_handler = std::move(rpc_handler),
                                                                 response = std::move(response)](bool) {}));
    }
};

agrpc::repeatedly_request(&test::v1::Test::AsyncService::RequestUnary, service, ProcessRPC{grpc_context});

npuichigo · 2021-11-01T02:30:40Z

Actually, I don't know whether it's necessary to align with theses lines to handle async requests.

} else if (status_ == PROCESS) {
      // Spawn a new CallData instance to serve new clients while we process
      // the one for this CallData. The instance will deallocate itself as
      // part of its FINISH state.
      new CallData(service_, cq_);

      // The actual processing.
      ...

That means, the coroutine version may be something like this:

awaitable<void> handle_rpc(agrpc::GrpcContext& grpc_context, 
                           helloworld::Greeter::AsyncService& service) {
    auto executor = co_await this_coro::executor;
    auto context = std::allocate_shared<UnaryRPCContext>(
        grpc_context.get_allocator());
    bool request_ok{true};
    request_ok = co_await agrpc::request(
       &helloworld::Greeter::AsyncService::RequestSayHello, service,
       context->server_context, context->request, context->writer);
    if (!request_ok) {
       co_return;
    }
    // This line
    co_spawn(executor, handle_rpc(grpc_context, service), detached);

    helloworld::HelloReply response;
    response.set_message(context->request.name());
    auto &writer = context->writer;
    co_await agrpc::finish(writer, response, grpc::Status::OK);
}

boost::asio::co_spawn(
      grpc_context,
      [&]() -> boost::asio::awaitable<void> {
        while (true) {     
          co_await handle_rpc(grpc_context, service);
        }
      },
      boost::asio::detached);

Even though the actuall processing time is long, there's still enough calls to RequestXXX with seperate coroutine to match the incoming requests.

npuichigo · 2021-11-01T06:35:14Z

I reruned the benchmark with my machine and the results are the consistent with yours.

However, it's strange that, after changing this line to co_await agrpc::finish(writer, response, grpc::Status::OK);, the performance is hurt significantly. I think it's also related to this #3 (comment).

npuichigo · 2021-11-01T07:55:52Z

In my experiment, after switch to use co_spawn immediately after receiving the request, the co_await agrpc::finish version can get comparable benchmark results.

Tradias · 2021-11-01T08:30:36Z

Seems legitimate to me. I mean that is exactly what repeatedly_request helps with: To ensure that the next call to Request is being made immediately, even before handling the particular request. If you look at https://github.com/Tradias/asio-grpc#snippet-repeatedly-request-spawner you will see an example that does exactly what your change is doing, aka. spawn another coroutine to handle the Finish while already having made another call to Request in the background.

The performance with unifex might be slightly better since they can avoid the extra dynamic memory allocation for the OperationState and store it in the coroutine frame directly. Although if they do not have memory recycling for the task itself like Boost.Asio does for their awaitable it might even out.

Tradias · 2021-11-01T12:28:29Z

I am very glad for your input and your efforts on adapting asio-grpc to libunifex. I think we should combine our efforts. I am open to pull requests, issues and ideas in general. Currently thinking about how to design an API for the Sender concept that works with different "backend"s like Asio's set_value and libunifex's set_value.
There is a CONTRIBUTING guideline that should get you started, as well as a CMakePresets.json that your IDE might find helpful.

npuichigo · 2021-11-01T13:04:39Z

Actually I have been following the recent proposals of how executor/network are landed in C++ standard. I did some efforts on adapting to libunifex for a more lightweight solution, since asio is not only related to executors. (maybe because I'm not so familiar with asio😀)

Now that you have already extended asio-grpc with libunifex support, maybe combine our efforts is a good way, and I am also willing to do some contribution for an easier-to-user async api. I will take a look up the recent updates and think about it.

For the rest, since uou mentioned before that we need some codegen to make things easier for users, my first though is to add some plugin like this to provide some dummy codes. Then I'd like to refer to some C# grpc api for the designs.

npuichigo · 2021-11-03T04:28:28Z

Close as the issue is addressed now. Hope we can have further discussion in the futute.

npuichigo closed this as completed Nov 3, 2021

RavikumarTulugu mentioned this issue Mar 1, 2023

single threaded asio client falls in to infinite loop when integrated with asio grpc #65

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

c++20 coroutine based version is not as fast as the one using boost fiber? #3

c++20 coroutine based version is not as fast as the one using boost fiber? #3

npuichigo commented Oct 28, 2021

Tradias commented Oct 28, 2021

npuichigo commented Oct 28, 2021

Tradias commented Oct 28, 2021

npuichigo commented Oct 28, 2021

Tradias commented Oct 28, 2021

npuichigo commented Oct 29, 2021

Tradias commented Oct 29, 2021

Tradias commented Oct 31, 2021

Tradias commented Oct 31, 2021

npuichigo commented Oct 31, 2021

Tradias commented Oct 31, 2021

npuichigo commented Nov 1, 2021 •

edited

Loading

npuichigo commented Nov 1, 2021 •

edited

Loading

npuichigo commented Nov 1, 2021 •

edited

Loading

Tradias commented Nov 1, 2021

Tradias commented Nov 1, 2021

npuichigo commented Nov 1, 2021

npuichigo commented Nov 3, 2021

c++20 coroutine based version is not as fast as the one using boost fiber? #3

c++20 coroutine based version is not as fast as the one using boost fiber? #3

Comments

npuichigo commented Oct 28, 2021

Tradias commented Oct 28, 2021

npuichigo commented Oct 28, 2021

Tradias commented Oct 28, 2021

npuichigo commented Oct 28, 2021

Tradias commented Oct 28, 2021

npuichigo commented Oct 29, 2021

Tradias commented Oct 29, 2021

Tradias commented Oct 31, 2021

Tradias commented Oct 31, 2021

npuichigo commented Oct 31, 2021

Tradias commented Oct 31, 2021

npuichigo commented Nov 1, 2021 • edited Loading

npuichigo commented Nov 1, 2021 • edited Loading

npuichigo commented Nov 1, 2021 • edited Loading

Tradias commented Nov 1, 2021

Tradias commented Nov 1, 2021

npuichigo commented Nov 1, 2021

npuichigo commented Nov 3, 2021

npuichigo commented Nov 1, 2021 •

edited

Loading

npuichigo commented Nov 1, 2021 •

edited

Loading

npuichigo commented Nov 1, 2021 •

edited

Loading