Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

c++20 coroutine based version is not as fast as the one using boost fiber? #3

Closed
npuichigo opened this issue Oct 28, 2021 · 18 comments
Closed

Comments

@npuichigo
Copy link

Recently I ran the grpc_bench to compare the performance of different settings. I found the coroutine based one is slower than both the boost fiber version and the grpc multi-thread version. Do you have any insight about this?

@Tradias
Copy link
Owner

Tradias commented Oct 28, 2021

I ran the cpp_grpc_mt_bench, cpp_asio_grpc_bench and a modified version of cpp_asio_grpc_bench that uses C++20 coroutines with GRPC_SERVER_CPUS=1 on a Windows machine:

Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz, Boost 1.77, gRPC 1.37.0, asio-grpc v1.2.0

with the command:

ghz --proto=asio-grpc\example\protos\helloworld.proto --call=helloworld.Greeter.SayHello --cpus 7 --insecure --concurrency=1000 --connections=50 --duration 20s --data-file 100B.txt 127.0.0.1:50051

Boost.Coroutine:

Summary:
  Count:        423578
  Total:        20.03 s
  Slowest:      582.23 ms
  Fastest:      0 ns
  Average:      26.45 ms
  Requests/sec: 21151.68

Response time histogram:
  0.000   [635]    |
  58.223  [407360] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  116.446 [12220]  |∎
  174.670 [1388]   |
  232.893 [21]     |
  291.116 [0]      |
  349.339 [0]      |
  407.563 [0]      |
  465.786 [247]    |
  524.009 [565]    |
  582.232 [188]    |

Latency distribution:
  10 % in 6.40 ms
  25 % in 14.00 ms
  50 % in 24.00 ms
  75 % in 32.45 ms
  90 % in 44.49 ms
  95 % in 53.04 ms
  99 % in 93.99 ms

Status code distribution:
  [OK]            422624 responses
  [Canceled]      517 responses
  [Unavailable]   437 responses

Error distribution:
  [517]   rpc error: code = Canceled desc = grpc: the client connection is closing
  [437]   rpc error: code = Unavailable desc = transport is closing

C++20 coroutines:

Summary:
  Count:        415305
  Total:        20.02 s
  Slowest:      1.07 s
  Fastest:      0 ns
  Average:      27.42 ms
  Requests/sec: 20746.18

Response time histogram:
  0.000    [885]    |
  107.035  [411518] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  214.069  [1553]   |
  321.104  [1]      |
  428.139  [0]      |
  535.173  [0]      |
  642.208  [0]      |
  749.243  [0]      |
  856.278  [0]      |
  963.312  [144]    |
  1070.347 [856]    |

Latency distribution:
  10 % in 5.99 ms
  25 % in 13.00 ms
  50 % in 23.31 ms
  75 % in 32.23 ms
  90 % in 45.08 ms
  95 % in 56.09 ms
  99 % in 93.16 ms

Status code distribution:
  [OK]            414957 responses
  [Canceled]      131 responses
  [Unavailable]   217 responses

Error distribution:
  [131]   rpc error: code = Canceled desc = grpc: the client connection is closing
  [217]   rpc error: code = Unavailable desc = transport is closing

gRPC multi-threaded example

Summary:
  Count:        413776
  Total:        20.01 s
  Slowest:      1.18 s
  Fastest:      0 ns
  Average:      27.10 ms
  Requests/sec: 20673.49

Response time histogram:
  0.000    [986]    |
  118.369  [409658] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  236.738  [1266]   |
  355.107  [0]      |
  473.477  [0]      |
  591.846  [0]      |
  710.215  [0]      |
  828.584  [0]      |
  946.953  [163]    |
  1065.322 [721]    |
  1183.692 [116]    |

Latency distribution:
  10 % in 5.89 ms
  25 % in 13.00 ms
  50 % in 23.02 ms
  75 % in 32.00 ms
  90 % in 44.00 ms
  95 % in 53.70 ms
  99 % in 93.75 ms

Status code distribution:
  [OK]            412910 responses
  [Canceled]      220 responses
  [Unavailable]   646 responses

Error distribution:
  [646]   rpc error: code = Unavailable desc = transport is closing
  [220]   rpc error: code = Canceled desc = grpc: the client connection is closing

I would say there is no significant different.
Results on Linux might vary as seen in the README. Slightly disappointing in my opinion to see such a difference between the gRPC multi-threaded example and my library on Linux, but then again, the example is hardly scalable to real world applications.

As for the difference between Boost.Coroutine and C++20 coroutines, I cannot say either. Boost.Asio's built-in memory recycling for C++20 coroutines is definitely active and working in my library. I think that is all I can do. We might just have to wait until compilers are better at optimizing C++20 coroutines. I still recommend to use them in-place of Boost.Coroutines.

I also ran the benchmarks through Intel VTune and cannot see anything immediately wrong, at least nothing seems to indicate that my implementation is causing some noticeable slowdown:
helloworld-boost-coroutine
helloworld-grpc-mt
helloworld-cpp20-coroutine

@npuichigo
Copy link
Author

Oh, I left out some information. I can get the comparable result with single thread. But the issue I mentioned is using multi-thread, maybe four with.

Could u also provide a benchmark using multiple threads?

@Tradias
Copy link
Owner

Tradias commented Oct 28, 2021

Sure, although the maximum my machine can saturate are 2 threads :)

@npuichigo
Copy link
Author

Anyway, thanks for your work. I’m trying to adapt your work with libunifex recently since it’s a more light-weight solution. 😀

@Tradias
Copy link
Owner

Tradias commented Oct 28, 2021

You're welcome. I might as well hack on libunifex a bit myself now :). Doesn't look much different from Boost.Asio, just lacking documentation

@npuichigo
Copy link
Author

just finished a part of adaptation here.

@Tradias
Copy link
Owner

Tradias commented Oct 29, 2021

wow that is so cool!

@Tradias
Copy link
Owner

Tradias commented Oct 31, 2021

I have updated the benchmarks in the README. They actually show better performance with C++20 coroutines compared to Boost.Coroutine on my Linux machine.

Also note that if you are running benchmarks for 4 CPU servers then you really need to ensure that the server is fully exhausted. E.g. on my 12 core machine using 8 cores for the client and 4 for the server I am unable to do so:

name req/s avg. latency 90 % in 95 % in 99 % in avg. cpu avg. memory
go_grpc 86840 8.74 ms 16.41 ms 19.92 ms 28.71 ms 247.3% 28.08 MiB
cpp_grpc_callback 85723 9.97 ms 15.16 ms 17.95 ms 24.42 ms 215.32% 158.87 MiB
cpp_asio_grpc_cpp20_coroutine 84986 8.52 ms 16.46 ms 22.15 ms 34.92 ms 230.18% 66.54 MiB
rust_tonic_mt 84591 9.04 ms 17.39 ms 22.55 ms 34.38 ms 281.41% 18.87 MiB
cpp_grpc_mt 83895 8.55 ms 16.86 ms 23.15 ms 37.01 ms 227.57% 67.24 MiB
cpp_asio_grpc_boost_coroutine 83631 8.66 ms 16.96 ms 22.62 ms 36.11 ms 231.3% 66.75 MiB
rust_grpcio 82151 9.31 ms 17.01 ms 22.90 ms 36.57 ms 274.72% 34.45 MiB
rust_thruster_mt 79143 9.96 ms 19.58 ms 25.58 ms 37.57 ms 288.98% 15.47 MiB

Notice the avg. cpu column should be 400% for the server to be exhausted. Since it is not, these results are completely useless.

@Tradias
Copy link
Owner

Tradias commented Oct 31, 2021

For the upcoming version of asio-grpc I am planning to:

  • Boost-less version that uses only standalone asio (already done on master)
  • Initial support for unified executors with libunifex

@npuichigo
Copy link
Author

Thanks for the info. I’d like to reproduce with my machine. By the way, will repeatedly_request improve the performance? What if there’s not enough request calls to match incoming requests?

@Tradias
Copy link
Owner

Tradias commented Oct 31, 2021

Yes good question. I couldn't find any information on "how many outstanding request calls" there should be at a time. Actually I just tested and it seems that if there are multiple outstanding calls to RequestXXX then all of them will handle the incoming RPC simultaneously. So I suppose that having exactly one is correct and that is what repeatedly_request is doing. E.g. the code in the benchmark could be rewritten to avoid coroutines entirely and only rely on callbacks which might yield better performance but I would say that such code is hardly scalable to real work applications:

struct ProcessRPC
{
    using executor_type = agrpc::GrpcContext::executor_type;

    agrpc::GrpcContext& grpc_context;

    auto get_executor() const noexcept { return grpc_context.get_executor(); }

    template <class RPCHandler>
    void operator()(RPCHandler&& rpc_handler, bool ok)
    {
        if (!ok)
        {
            return;
        }
        auto args = rpc_handler.args();
        auto response = std::allocate_shared<test::v1::Response>(grpc_context.get_allocator());
        response->set_integer(21);
        auto& response_ref = *response;
        agrpc::finish(std::get<2>(args), response_ref, grpc::Status::OK,
                      asio::bind_executor(this->get_executor(), [rpc_handler = std::move(rpc_handler),
                                                                 response = std::move(response)](bool) {}));
    }
};

agrpc::repeatedly_request(&test::v1::Test::AsyncService::RequestUnary, service, ProcessRPC{grpc_context});

@npuichigo
Copy link
Author

npuichigo commented Nov 1, 2021

Actually, I don't know whether it's necessary to align with theses lines to handle async requests.

} else if (status_ == PROCESS) {
      // Spawn a new CallData instance to serve new clients while we process
      // the one for this CallData. The instance will deallocate itself as
      // part of its FINISH state.
      new CallData(service_, cq_);

      // The actual processing.
      ...

That means, the coroutine version may be something like this:

awaitable<void> handle_rpc(agrpc::GrpcContext& grpc_context, 
                           helloworld::Greeter::AsyncService& service) {
    auto executor = co_await this_coro::executor;
    auto context = std::allocate_shared<UnaryRPCContext>(
        grpc_context.get_allocator());
    bool request_ok{true};
    request_ok = co_await agrpc::request(
       &helloworld::Greeter::AsyncService::RequestSayHello, service,
       context->server_context, context->request, context->writer);
    if (!request_ok) {
       co_return;
    }
    // This line
    co_spawn(executor, handle_rpc(grpc_context, service), detached);

    helloworld::HelloReply response;
    response.set_message(context->request.name());
    auto &writer = context->writer;
    co_await agrpc::finish(writer, response, grpc::Status::OK);
}

boost::asio::co_spawn(
      grpc_context,
      [&]() -> boost::asio::awaitable<void> {
        while (true) {     
          co_await handle_rpc(grpc_context, service);
        }
      },
      boost::asio::detached);

Even though the actuall processing time is long, there's still enough calls to RequestXXX with seperate coroutine to match the incoming requests.

@npuichigo
Copy link
Author

npuichigo commented Nov 1, 2021

I reruned the benchmark with my machine and the results are the consistent with yours.

However, it's strange that, after changing this line to co_await agrpc::finish(writer, response, grpc::Status::OK);, the performance is hurt significantly. I think it's also related to this #3 (comment).

@npuichigo
Copy link
Author

npuichigo commented Nov 1, 2021

In my experiment, after switch to use co_spawn immediately after receiving the request, the co_await agrpc::finish version can get comparable benchmark results.

@Tradias
Copy link
Owner

Tradias commented Nov 1, 2021

Seems legitimate to me. I mean that is exactly what repeatedly_request helps with: To ensure that the next call to Request is being made immediately, even before handling the particular request. If you look at https://github.com/Tradias/asio-grpc#snippet-repeatedly-request-spawner you will see an example that does exactly what your change is doing, aka. spawn another coroutine to handle the Finish while already having made another call to Request in the background.

The performance with unifex might be slightly better since they can avoid the extra dynamic memory allocation for the OperationState and store it in the coroutine frame directly. Although if they do not have memory recycling for the task itself like Boost.Asio does for their awaitable it might even out.

@Tradias
Copy link
Owner

Tradias commented Nov 1, 2021

I am very glad for your input and your efforts on adapting asio-grpc to libunifex. I think we should combine our efforts. I am open to pull requests, issues and ideas in general. Currently thinking about how to design an API for the Sender concept that works with different "backend"s like Asio's set_value and libunifex's set_value.
There is a CONTRIBUTING guideline that should get you started, as well as a CMakePresets.json that your IDE might find helpful.

@npuichigo
Copy link
Author

Actually I have been following the recent proposals of how executor/network are landed in C++ standard. I did some efforts on adapting to libunifex for a more lightweight solution, since asio is not only related to executors. (maybe because I'm not so familiar with asio😀)

Now that you have already extended asio-grpc with libunifex support, maybe combine our efforts is a good way, and I am also willing to do some contribution for an easier-to-user async api. I will take a look up the recent updates and think about it.

For the rest, since uou mentioned before that we need some codegen to make things easier for users, my first though is to add some plugin like this to provide some dummy codes. Then I'd like to refer to some C# grpc api for the designs.

@npuichigo
Copy link
Author

Close as the issue is addressed now. Hope we can have further discussion in the futute.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants