Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(*) Lua bridge refactor and dispatch fixes #546

Merged
merged 6 commits into from
Jun 11, 2024

Conversation

thibaultcha
Copy link
Member

@thibaultcha thibaultcha commented May 14, 2024

Major refactor of the Lua bridge to support multiple concurrent yielding Lua threads, and refactor of dispatch calls failures to continue executing the request. Replaces #539, #545, #523.

The new implementation "tricks" OpenResty by scheduling uthreads via C and passing these threads to the OpenResty runloop as if they were created from Lua (via ngx.thread). Because all uthreads must resume their "parent thread" when finished (as per OpenResty's implementation), we schedule a stub "entry thread" whenever we are trying to use the Lua bridge. This entry thread itself does nothing and is collected at request pool cleanup.

List of significant changes for this refactor:

  • Fix: cancel pending dispatches when immediately producing a response after.
  • Fix: resume content phase after request_body read EAGAIN.
  • Breaking: failing dispatch calls do not produce HTTP 500 responses anymore. Instead, the request execution is resumed normally.
  • Breaking: the proxy_wasm.start() FFI function is removed. Only proxy_wasm.attach() is now necessary, and the filter chain is only resumed once the ngx_http_wasm_module rewrite or access phases are entered. Prior, proxy_wasm.start() would resume the filter chain during the ngx_http_lua_module phase handlers, which was incompatible with Lua threads yielding.
  • The ngx.semaphore API can be used in the Lua bridge. The default Lua resolver now has synchronization enabled.
  • In ngx_wasm_socket_tcp, the sock->env member is now a pointer to the request's env instead of a copy so as to manipulate the env->state control variable.
  • The wasm_call directive can now yield, which allows for sanity testing of the Lua bridge yielding functionality.
  • A new rctx->resume_handler pointer holds the resume entry point back from yielding facilities into ngx_http_core_run_phases. For now, only the Lua bridge uses it, but other yielding facilities should be refactored to use it so as to factorize our resuming code.

Fix #524
Fix #528

@thibaultcha thibaultcha force-pushed the refactor/lua-bridge-and-dispatch-fix branch from d3dcbf8 to 195461e Compare May 14, 2024 17:04
Copy link

codecov bot commented May 14, 2024

Codecov Report

Attention: Patch coverage is 88.99614% with 57 lines in your changes missing coverage. Please review.

Project coverage is 90.36951%. Comparing base (277fac6) to head (b9bb644).

Additional details and impacted files

Impacted file tree graph

@@                 Coverage Diff                 @@
##                main        #546         +/-   ##
===================================================
+ Coverage   90.05848%   90.36951%   +0.31103%     
===================================================
  Files             47          47                 
  Lines          10089       10311        +222     
===================================================
+ Hits            9086        9318        +232     
+ Misses          1003         993         -10     
Files Coverage Δ
src/common/lua/ngx_wasm_lua_ffi.c 92.92035% <100.00000%> (-0.67965%) ⬇️
src/common/ngx_wasm_socket_tcp_readers.c 87.81513% <100.00000%> (ø)
src/common/ngx_wasm_subsystem.c 100.00000% <100.00000%> (ø)
src/common/proxy_wasm/ngx_proxy_wasm.c 92.58850% <100.00000%> (+0.04952%) ⬆️
src/common/proxy_wasm/ngx_proxy_wasm.h 91.89189% <ø> (ø)
src/common/proxy_wasm/ngx_proxy_wasm_host.c 94.37751% <100.00000%> (-0.02250%) ⬇️
src/common/proxy_wasm/ngx_proxy_wasm_util.c 94.14062% <ø> (ø)
src/http/ngx_http_wasm_util.c 87.06199% <100.00000%> (+0.07011%) ⬆️
src/http/proxy_wasm/ngx_http_proxy_wasm.c 93.71728% <100.00000%> (+0.20377%) ⬆️
src/http/proxy_wasm/ngx_http_proxy_wasm_dispatch.c 91.62562% <100.00000%> (+0.39754%) ⬆️
... and 8 more
Flag Coverage Δ
unit 90.13820% <88.56502%> (-0.01297%) ⬇️
valgrind 81.21406% <85.62092%> (+3.19863%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

@hishamhm
Copy link
Collaborator

@hishamhm I would like for the datakit filter to be tested with #546 before merging it

@thibaultcha I've just tested it and it's still crashing. Running it enough times I do get segfaults, but every run is reporting memory access errors consistently.

These are the logs from a Gateway run with Valgrind. This particular run didn't crash, but Valgrind reported reads on data that was freed by ngx_proxy_wasm_dispatch_calls_cancel:
pr-546-kong-error.log
pr-546-kong-valgrind.log

This was running a datakit configuration with 2 parallel calls that fail (I didn't spin up the other localhost upstream).

@thibaultcha
Copy link
Member Author

@hishamhm What is the specific reproducible example for this? It does not trigger in the test given with this PR, and seems to trigger when send_local_response is used at some point (as per the Valgrind report). I tried updating that test with:

    location /t {
        proxy_wasm hostcalls 'on=request_body \
                              test=/t/dispatch_http_call \
                              host=127.0.0.1:1 \
                              ncalls=2';
        proxy_wasm hostcalls 'on=request_body \
                              test=/t/echo/body';
        echo ok;
    }

But it does not trigger any memory issues.

@hishamhm
Copy link
Collaborator

@thibaultcha Sure, let me try to get the same filter crash without the gateway.

@hishamhm
Copy link
Collaborator

Took me a while but I got a consistent repro running without the gateway:
https://github.com/Kong/ngx_wasm_module/actions/runs/9133572211/job/25117356096

This branch/commit was intended only for sharing the test case:
495c13d
(Loading Datakit on Wasmtime with Valgrind takes a couple of minutes, so I enabled the cache in the testcase as a quality-of-life tweak for speeding up debugging runs...)

I think we can eventually isolate the offending proxy-wasm calls and produce a hostcalls-based testcase, but to go one step at a time I wanted to get the error reproduced without the Gateway on Datakit first.

Looking at the logs, I think the triggering condition is to dispatch a call, then in the same handler (in this case, on_request_headers), trigger a local response, before the dispatch callback gets a chance to run.

@hishamhm
Copy link
Collaborator

I think I got a minimized test case. Branch tests/interrupted-dispatch on top of this PR — this test: 6cb6899

Still waiting for the CI outcome here, but it produces the Valgrind reports locally.

@thibaultcha
Copy link
Member Author

@hishamhm Thanks for the test! This turned out to be yet another thing!

@thibaultcha thibaultcha force-pushed the refactor/lua-bridge-and-dispatch-fix branch 10 times, most recently from 3965484 to 98c4572 Compare May 27, 2024 02:38
@thibaultcha
Copy link
Member Author

thibaultcha commented May 28, 2024

@hishamhm Would you give it another try with the current state of this branch? Except for a small failure in dynamic builds I'm still investigating it all runs green, so I hope it fixes everything now.

@thibaultcha thibaultcha force-pushed the refactor/lua-bridge-and-dispatch-fix branch from 64eb2b1 to 98c4572 Compare May 28, 2024 17:40
@thibaultcha thibaultcha force-pushed the refactor/lua-bridge-and-dispatch-fix branch 2 times, most recently from 308a043 to 587d974 Compare May 28, 2024 22:10
@thibaultcha
Copy link
Member Author

Ok I fixed the last problem but @flrgh found another problem in the Kong PR which I also am looking at now.

@thibaultcha thibaultcha force-pushed the refactor/lua-bridge-and-dispatch-fix branch from 587d974 to 169282f Compare May 29, 2024 02:52
@thibaultcha
Copy link
Member Author

@flrgh Ok, the latest state of this branch should also take care of the Gateway issue. I have the spec/02-integration/20-wasm/ suite passing locally.

@thibaultcha thibaultcha force-pushed the refactor/lua-bridge-and-dispatch-fix branch from 169282f to 75e1017 Compare May 29, 2024 14:55
@flrgh
Copy link
Contributor

flrgh commented May 29, 2024

@flrgh Ok, the latest state of this branch should also take care of the Gateway issue. I have the spec/02-integration/20-wasm/ suite passing locally.

Nice! Looking good with 75e1017 at my end too.

@hishamhm
Copy link
Collaborator

I'll give this branch a spin later today!

@thibaultcha thibaultcha force-pushed the refactor/lua-bridge-and-dispatch-fix branch 4 times, most recently from 8954d73 to a7bb6a1 Compare May 30, 2024 01:48
flrgh added a commit to Kong/kong that referenced this pull request Jun 3, 2024
@thibaultcha
Copy link
Member Author

I have found one more bug that I am trying to get rid of before merging this.

@thibaultcha thibaultcha force-pushed the refactor/lua-bridge-and-dispatch-fix branch 4 times, most recently from b4c035a to da0bb17 Compare June 10, 2024 15:04
flrgh added a commit to Kong/kong that referenced this pull request Jun 10, 2024
@thibaultcha thibaultcha force-pushed the refactor/lua-bridge-and-dispatch-fix branch from da0bb17 to 0e39182 Compare June 11, 2024 02:59
Major refactor of the Lua bridge to support multiple concurrent yielding
Lua threads. The old implementation would break down when scheduling
more than one yielding Lua thread at a time.

The new implementation "tricks" OpenResty by scheduling uthreads via C
and passing these threads to the OpenResty runloop as if they were
created from Lua (via `ngx.thread`). Because all uthreads must resume
their "parent thread" when finished (as per OpenResty's implementation),
we schedule a stub "entry thread" whenever we are trying to use the Lua
bridge. This entry thread itself does nothing and is collected at
request pool cleanup.

List of significant changes for this refactor:

- **Breaking:** the `proxy_wasm.start()` FFI function is **removed**.
  Only `proxy_wasm.attach()` is now necessary, and the filter chain is
  only resumed once the ngx_http_wasm_module `rewrite` or `access`
  phases are entered. Prior, `proxy_wasm.start()` would resume the
  filter chain during the ngx_http_lua_module phase handlers, which was
  incompatible with Lua threads yielding.
- The `ngx.semaphore` API can be used in the Lua bridge. The default Lua
  resolver now has synchronization enabled.
- In ngx_wasm_socket_tcp, the `sock->env` member is now a pointer to the
  request's `env` instead of a copy so as to manipulate the `env->state`
  control variable.
- The `wasm_call` directive can now yield, which allows for sanity
  testing of the Lua bridge yielding functionality.
- A new `rctx->resume_handler` pointer holds the resume entry point back
  from yielding facilities into `ngx_http_core_run_phases`. For now,
  only the Lua bridge uses it, but other yielding facilities should be
  refactored to use it so as to factorize our resuming code.

Fix #524
Originally this fixes an issue described in #545, in which a failing
dispatch call may cause a pending call to segfault when it is resumed.

To fix this issue, this commit updates how dispatch calls are treated
and makes the execution continue when they fail, including the Lua
bridge resolver (very tricky).
Delete the posted event for dispatches that were sent in the latest
Proxy-Wasm step.
The invocation of `coctx->cleanup` previously introduced allows removing
this custom handling of pending sleep timers added to the Lua bridge.
Kept as a commit for future reference.
@thibaultcha thibaultcha force-pushed the refactor/lua-bridge-and-dispatch-fix branch from 0e39182 to b9bb644 Compare June 11, 2024 05:03
@thibaultcha thibaultcha merged commit 3dbf52a into main Jun 11, 2024
35 checks passed
@thibaultcha thibaultcha deleted the refactor/lua-bridge-and-dispatch-fix branch June 11, 2024 15:10
@thibaultcha
Copy link
Member Author

@flrgh @hishamhm Merged! I think in time for the Gateway patch release, hopefully.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants