Performance of triggering UDF execution engine #43

fhoering · 2024-02-29T08:38:53Z

The specification has some explanations here on how JS and WASM workloads would be handled by the UDF execution engine:
https://github.com/privacysandbox/protected-auction-services-docs/blob/main/bidding_auction_services_system_design.md#adtech-code-execution-engine
https://github.com/privacysandbox/data-plane-shared-libraries/tree/main/scp/cc/roma

This design looks interesting and I'm trying to find out if it would be able to handle workloads with thousands of QPS per instance and 10ms latency. I'm wondering in particular how it would work with managed languages like c# or Java compiled to WASM.

From what it understand there will be N pre allocated workers each able to handle single threaded workloads.

The doc mentions this part about JS

Javascript code is preloaded, cached, interpreted and a snapshot is created. A new context is created from the snapshot before execution in a Roma worker.

What does recreating the context exactly imply in terms of performance ?

The WASM module is preloaded and cached in Roma in the non request path. WASM is parsed for every execution within a Roma worker.

My understanding is that compiling c#/Java to WASM works like a self contained executable which means the runtime needs to be embedded inside the WASM file. If the runtime and garbage collector would be initialized all the time for each request the overhead is very probably prohibitive for workloads mentioned above.

Can you provide more information on how JS and WASM (java, c#) workloads would be handled exactly with the UDF execution and if this could handle the workloads mentioned above ?

lx3-g · 2024-03-27T12:35:45Z

Hi Fabian Höring,

Thnx for the question and sorry for the delay. We're currently working on getting back to you,

Thank you,
Alexander

peiwenhu · 2024-04-08T18:08:34Z

Hi sorry for the long delay.

Could you please elaborate on the workload you have in mind? The throughput and latency performance depends greatly on the workload. For example what’s the context and use case and what’s the main operations?
SetCodeObject is called before ExecuteCode so the Snapshot would be created. This should minimize the latency for context creation. We verified this using benchmarks. Context has to be created for every execution to maintain per-execution isolation.
The latency varies depending on the UDF code. If you are curious about the numbers for your code, they should be easy to using benchmark setup we have.
Since the setup is in a dependency, that repository would need to be pulled in.
This can be done by cloning and checking out the commit we use. Example -
git clone https://github.com/privacysandbox/data-plane-shared-libraries.git
cd data-plane-shared-libraries
git checkout 89e8cf07e233779e92915fa6fbcd854f648e327c
Or downloading the zip file for the version we use at HEAD.
To run the benchmarks implemented, the following command can be used -
scripts/run-benchmarks --target //src/roma/benchmark:kv_server_udf_benchmark_test --benchmark_time_unit ms

Sample output -

Benchmark	Time	CPU	Iterations	UserCounters...
BM_LoadHelloWorld/0	11.2 ms	0.047 ms	1000	bytes_per_second=596.362Ki/s
BM_LoadHelloWorld/128	11.1 ms	0.043 ms	1000	bytes_per_second=3.60729Mi/s
BM_LoadHelloWorld/512	11.2 ms	0.048 ms	1000	bytes_per_second=10.9137Mi/s
BM_LoadHelloWorld/1024	11.2 ms	0.045 ms	1000	bytes_per_second=22.4478Mi/s
BM_LoadHelloWorld/10000	11.4 ms	0.074 ms	1000	bytes_per_second=129.585Mi/s
BM_LoadHelloWorld/20000	11.5 ms	0.085 ms	1000	bytes_per_second=223.786Mi/s
BM_LoadHelloWorld/50000	11.9 ms	0.147 ms	1000	bytes_per_second=325.068Mi/s
BM_LoadHelloWorld/100000	12.7 ms	0.274 ms	1000	bytes_per_second=348.233Mi/s
BM_LoadHelloWorld/200000	14.1 ms	0.497 ms	1374	bytes_per_second=384.041Mi/s
BM_LoadHelloWorld/500000	18.8 ms	1.26 ms	493	bytes_per_second=378.279Mi/s
BM_ExecuteHelloWorld	0.905 ms	0.027 ms	10000	items_per_second=37.3452k/s
BM_ExecuteHelloWorldCallback	0.982 ms	0.028 ms	10000	items_per_second=35.8002k/s

We would like to help you achieve an accurate measurement. If you are interested in collaborating to measure more please feel free to let us know what you think we may be able to help with.

fhoering · 2024-04-09T12:08:51Z

Hello @peiwenhu,

Thanks for the information. I will come back to you with more information about the workloads.

About compîling data-plane-shared-libraries locally.
Sorry for this question but I never used Google specific tooling like bazel before.

If I do this:

cd data-plane-shared-libraries
git checkout 89e8cf07e233779e92915fa6fbcd854f648e327c

What command do I need to execute to compile this ?
How do I need to change the workspace file to actually pull my local sources ?

a-shruti · 2024-04-09T14:14:37Z

Hello!

What command do I need to execute to compile this ?

For running benchmarks, you can simply use the specified command -

scripts/run-benchmarks

(For running your benchmarks, you will need to modify kv_server_udf_benchmark_test and include the code you want to benchmark.)

If you run at HEAD for data-plane-shared-libraries, the following command can be used -

scripts/run-benchmarks --target //src/roma/benchmark:kv_server_udf_benchmark_test --benchmark_time_unit ms

How do I need to change the workspace file to actually pull my local sources

For running benchmarks, I don't think you have to modify your K/V server workspace. However, in general local_repository bazel rule can be used.

Let us know if any more information is needed from our side.

Thanks!

fhoering · 2024-04-15T07:42:12Z

Thanks. I was able to execute the benchmarks like this:

./builders/tools/bazel-debian run //scp/cc/roma/benchmark/test:kv_server_udf_benchmark_test -- --benchmark_out=/src/workspace/dist/benchmarks/kv_server_udf_benchmark.json --benchmark_out_format=json --benchmark_time_unit=ms
./builders/tools/bazel-debian run //scp/cc/roma/benchmark/test:benchmark_suite_test -- --benchmark_out=/src/workspace/dist/benchmarks/benchmark_suite_test.json --benchmark_out_format=json --benchmark_time_unit=ms

I get the same results of ~1ms for executing an empty JS function.

I also ran the multi threaded roma workloads which give results of ~1000 request per sec.

An additional question on that. I never succeeded to get dropped requests even with a worker size of 1 and a queue size of 1. Is this expected ?

test_configuration.workers = 1;
    test_configuration.inputs_type = InputsType::kSimpleString;
    test_configuration.input_payload_in_byte = 500000;
    test_configuration.queue_size = 1;
    test_configuration.batch_size = 1;
    test_configuration.request_threads = 30;
    test_configuration.requests_per_thread = 1000;

Those tests seem to be done from the same machine which means the client can impact the server and vice versa. We probably have to run the full server and run the client with some web load injector like gatling to get representative results.

a-shruti · 2024-04-16T15:23:22Z

An additional question on that. I never succeeded to get dropped requests even with a worker size of 1 and a queue size of 1. Is this expected ?

In these benchmarks, for every request, we wait for the response individually. Hence, this behaviour is expected.

* Release 0.11.0 (2023-07-11) (privacysandbox#40) ### Features * [Breaking change] Use UserDefinedFunctionsConfig instead of KVs for loading UDFs. * [Sharding] Add hpke for s2s communication * [Sharding] Allow for partial data lookups * [Sharding] Making downstream requests in parallel * Add bazel build flag --announce_rc * Add bool parameter to allow routing V1 requests through V2. * Add buf format pre-commit hook * Add build time directive for reentrant parser. * Add functions to retrieve instance information. * Add internal run query client and server. * Add JS hook for set query. * Add lookup client and server for communication with shards * Add MessageQueue for the request simulation system * Add query grammar and interface for set queries. * Add rate limiter for the request simulation system * Add second map to store key value set and add set value update interfaces * Add shard metadata for supporting sharded files * Add simple microbenchmarks for key value cache * Add UDF support for format data command. * Add unit tests for query lexer. * Adding cluster mappings manager * Adding padding * Apply custom lockings on the cache * Connect InternalRunQuery to the parser * Extend and simplify collect-logs to capture test outputs * Extend use of scp deps via data-plane-shared repo * Implement shard manager * Move sharding function to public so it's available for file sharding * Register a logging hook with the UDF. * Register run query hook with udf framework. * Sharding - realtime updates * Sharding read flow fixes * Simplify work done in set operations. Set operations can be passed by * Snapshot files support UDF configs. * Support reading and writing set queries to data files. * Support reading and writing set values for csv files * Support reading/writing DataRecords. Requires new DELTA format. * Support writing sharded files * Update data_loading.fb to support UDF code updates. * Update pre-commit hook versions * Update shard manager mappings continuously * Upgrade build-system to release-0.28.0 * Upgrade build-system to v0.30.1 * Upgrade scp to 0.72.0 * Use Unix domain socket for internal lookup server. * Utilize AWS deps via data-plane-shared repo ### Bug Fixes * Add internal lookup client deadline. * Catch error if insufficient args specified * Fix aggregation logic for set values. * Fix ASAN potential deadlock errors in key_value_cache_test * Proper memory management of callback hook wrappers. * Specify 2 workers for UDF execution. * Upgrade pre-commit hooks * Use shared pointer for UDF absl::Notification. ### Build Tools: Fixes * **build:** Add scope-based sections in release notes ### Documentation * Add docs for data loading capabilities * Add explanation that access control is managed by IAM for writes. * Point readme to a new sharding public explainer Bug: 290798418 Change-Id: I691da695f5727a8517ed3e9f18a3a2d8c5b9e0bf GitOrigin-RevId: 5958051464911b6da60c38bc2a83c3451adadf42 Co-authored-by: Privacy Sandbox Team <no-reply@google.com> * Release 0.11.1 (2023-08-02) (privacysandbox#42) Bug: b/293901782 Change-Id: I4487c821883756599b3d66bb7774cc6585e653dc GitOrigin-RevId: e8cd94de2dbb081a724e9700c98fc61bbd511687 Co-authored-by: Privacy Sandbox Team <no-reply@google.com> --------- Co-authored-by: Peiwen Hu <peiwenhu@google.com> Co-authored-by: Privacy Sandbox Team <no-reply@google.com>

fhoering · 2024-07-09T15:57:09Z

@peiwenhu @lx3-g @a-shruti

We have done several web load tests with Gatling (ramp-up then constant load to 100k QPS over several steps and 120 secs each).

Performance test setup (c# vs JS UDF execution)

We deployed KV server version 0.16 on our own infrastructure in a container on 1 instance with 8 cores (16 threads) & 16 GB of memory with the following components:

Envoy proxy
ROMA JS UDF engine with 16 workers
In memory lookup

Javascript UDF code:

no batching (broken in version 0.16)
it reads the keys from the request;
it queries the in-memory datastore to get the values of those keys;
it replies with those values.

We then implemented the same basic logic in a c# vanilla asp.net server and executed the same web load test for comparison.

Web assembly test setup

We have deployed and benchmarked the provided c++ => WASM sample file from here (file size: 100 KB)

We also tried out the provided Microsoft templates to compile c# to WASM (dotnet 9 required, file size: 30 MB, contains c# runtime)

dotnet workload install wasi-experimental
dotnet new wasiconsole -o Ans43
dotnet build -c Release

Results

Conclusion

Google KV JS implementation can handle 1000s of queries per second with ms second latency
an empty c#/asp.net service can handle 10 times more queries (45k QPS vs 5k QPS) with better mean response times (3ms vs 6ms)
c++ WASM can handle 100 QPS, there is a clear down lift in performance, seems prohibitive already
c# WASM can handle 1 QPS, it replies but can be considered as not working
WASM in general doesn't seem like a real alternative, so the effective mandatory UDF backend seems to be Javascript

This was referenced Jul 10, 2024

Inline large data structures for caching #62

Open

Proposal for execution engine allowing runtime languages (like c#, Java) #66

Open

fhoering mentioned this issue Sep 6, 2024

Agenda items for Bidding&Auction Service - TPAC 2024 WICG/protected-auction-services-discussion#85

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance of triggering UDF execution engine #43

Performance of triggering UDF execution engine #43

fhoering commented Feb 29, 2024

lx3-g commented Mar 27, 2024

peiwenhu commented Apr 8, 2024 •

edited by pmeric

Loading

fhoering commented Apr 9, 2024

a-shruti commented Apr 9, 2024 •

edited

Loading

fhoering commented Apr 15, 2024

a-shruti commented Apr 16, 2024

fhoering commented Jul 9, 2024 •

edited

Loading

Performance of triggering UDF execution engine #43

Performance of triggering UDF execution engine #43

Comments

fhoering commented Feb 29, 2024

lx3-g commented Mar 27, 2024

peiwenhu commented Apr 8, 2024 • edited by pmeric Loading

fhoering commented Apr 9, 2024

a-shruti commented Apr 9, 2024 • edited Loading

fhoering commented Apr 15, 2024

a-shruti commented Apr 16, 2024

fhoering commented Jul 9, 2024 • edited Loading

Performance test setup (c# vs JS UDF execution)

Web assembly test setup

Results

Conclusion

peiwenhu commented Apr 8, 2024 •

edited by pmeric

Loading

a-shruti commented Apr 9, 2024 •

edited

Loading

fhoering commented Jul 9, 2024 •

edited

Loading