Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(runloop) execute plugins on Nginx-produced errors #3533

Merged
merged 1 commit into from
Jun 15, 2018

Conversation

thibaultcha
Copy link
Member

@thibaultcha thibaultcha commented Jun 12, 2018

Context

Before this patch, several scenarios would make Kong/Nginx produce
errors that did not execute plugins:

  1. Nginx-produced client errors (HTTP 4xx) on invalid requests per the
    Nginx settings (e.g. headers too large, URI too large, etc...)
  2. Proxy TCP/HTTP errors on the last balancer try (e.g. connection
    refused, connection timeout, read/write timeout, etc...)
  3. Any other Nginx-produced server errors (HTTP 5xx), if any

Because plugins are not executed in such cases, logging plugins in
particular do not get a chance to report such errors, leaving Kong
administrator blind when relying on bundled or homegrown logging
plugins. Additionally, other plugins, e.g. transformation, do not get a
chance to run either and could lead to scenarios where those errors
produce headers or response bodies that are undesired by the
administrator.

This issue is similar to that fixed by #3079, which handled the
following case:

  1. Requests short-circuited by Kong itself (vs Nginx in 1, 2, and 3.)

See also #490 & #892 for issues related to 4. This patches addresses 1,
2, and 3.

Problem

Nginx-produced errors (4xx and 5xx) are redirected to the
/kong_error_handler location by way of the error_page directive,
but the runloop is not given a chance to run, since the error handler
is only instructed to override the Nginx-defined HTML responses, and
produces Kong-friendly ones.

In short, a patch solving this issue must enable the runloop within the
error handler.

Proposed solution

Several (many) possibilities were evaluated and explored to fix this.
The proposed patch is the result of many attempts, observations, and
adjustments to initial proposals. In a gist, it proposes:

  • Executing the response part of the runloop in the error_page location.
  • Ensure that if the request part of the runloop did not have a chance
    to run, we do build the list of plugins to execute (thus, global plugins
    only since the request was not processed).
  • Ensure that if the request part of the runloop did run, we preserve
    its ngx.ctx values.

A few catches:

  • Currently, ngx.req.get_method() in the error_page location block
    will return GET even if the request method was different (only
    HEAD is preserved). A follow up patch for Nginx is being prepared.
    As such, several tests are currently marked as "Pending".
  • When the rewrite phase did not get a chance to run, some parts of
    the request may not have been parsed by Nginx (e.g. on HTTP 414), thus
    some variables, such as $request_uri or request headers may not have
    been parsed. Code should be written defensively against such
    possibilities, and logging plugins may report incomplete request data
    (but will still report the produced error).
  • The uninitialized_variable_warn directive must be disabled in the
    error_page location block for cases when the rewrite phase did not
    get a chance to run (Nginx-produced 4xx clients errors).
  • The ngx.ctx variables must be copied by reference for plugins to be
    able to access it in the response phases when the request was
    short-circuited by Nginx (e.g. HTTP 502/504 from the upstream module).
  • Using a named location for the error_page handler is not possible,
    since in some cases the URI might not be parsed by Nginx, and thus Nginx
    refuses to call named locations in such cases (e.g. on HTTP 414).

Resulting behavior

The 03-plugins_triggering_spec.lua test suite is now a good reference
for the behavior observed upon Nginx-produced errors. This behavior can
be classified in 2 categories:

  1. When the request is not parsed by Nginx (HTTP 4xx)
    a. The error_page directive is called (rewrite and access are
    skipped)
    b. The runloop retrieves the list of global plugins to run (it
    cannot retrieve more specific plugins since the request could not be
    parsed)
    c. The error content handler overrides the Nginx body with a Kong-friendly
    one (unchanged from previous behavior)
    d. It executes the header_filter phase of each global plugin.
    e. It executes the body_filter phase of each global plugin.
    f. It executes the log phase of each global plugin.

  2. When the error was produced during upstream module execution (HTTP 5xx)
    a. The rewrite and access phases ran normally (we stashed the
    ngx.ctx reference)
    b. The error_page directive is called
    c. The error content handler applies the original ngx.ctx by
    reference
    d. It does not run the plugins loop, since ngx.ctx contains
    the list of plugins to run (global and specific plugins in
    this case)
    e. The error content handler overrides the Nginx body with a Kong-friendly
    one (unchanged from previous behavior)
    f. It executes the header_filter phase of each global plugin.
    g. It executes the body_filter phase of each global plugin.
    h. It executes the log phase of each global plugin.

Fix #3193

@@ -147,11 +148,25 @@ server {
}
}

location = /kong_error_handler {
location /kong_error_handler {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to keep the =

@thibaultcha thibaultcha force-pushed the feat/plugins-on-nginx-errors branch from 56c8d9c to 2f78f51 Compare June 14, 2018 00:36
hishamhm added a commit that referenced this pull request Jun 14, 2018
Report both connection timeouts at the `balancer` phase and
upstream timeouts as `timeout` events when using passive healthchecks.

Previously, connection timeouts were being misreported as `tcp_failures`.
Upstream timeouts can now be caught due to recent runloop improvements for
executing plugins on Nginx-produced errors (#3533).

Includes regression tests for both cases.
@thibaultcha thibaultcha force-pushed the feat/plugins-on-nginx-errors branch from 2f78f51 to 774d326 Compare June 15, 2018 01:26
Context
-------

Before this patch, several scenarios would make Kong/Nginx produce
errors that did not execute plugins:

1. Nginx-produced client errors (HTTP 4xx) on invalid requests per the
   Nginx settings (e.g. headers too large, URI too large, etc...)
2. Proxy TCP/HTTP errors on the last balancer try (e.g. connection
   refused, connection timeout, read/write timeout, etc...)
3. Any other Nginx-produced server errors (HTTP 5xx), if any

Because plugins are not executed in such cases, logging plugins in
particular do not get a chance to report such errors, leaving Kong
administrator blind when relying on bundled or homegrown logging
plugins. Additionally, other plugins, e.g. transformation, do not get a
chance to run either and could lead to scenarios where those errors
produce headers or response bodies that are undesired by the
administrator.

This issue is similar to that fixed by #3079, which handled the
following case:

4. Requests short-circuited by Kong itself (vs Nginx in 1, 2, and 3.)

See also #490 & #892 for issues related to 4. This patches addresses 1,
2, and 3.

Problem
-------

Nginx-produced errors (4xx and 5xx) are redirected to the
`/kong_error_handler` location by way of the `error_page` directive,
but the runloop is not given a chance to run, since the error handler
is only instructed to override the Nginx-defined HTML responses, and
produces Kong-friendly ones.

In short, a patch solving this issue must enable the runloop within the
error handler.

Proposed solution
-----------------

Several (many) possibilities were evaluated and explored to fix this.
The proposed patch is the result of many attempts, observations, and
adjustments to initial proposals. In a gist, it proposes:

* Executing the response part of the runloop in the error_page location.
* Ensure that if the request part of the runloop did not have a chance
  to run, we do build the list of plugins to execute (thus, global plugins
  only since the request was not processed).
* Ensure that if the request part of the runloop did run, we preserve
  its `ngx.ctx` values.

A few catches:

* Currently, `ngx.req.get_method()` in the `error_page` location block
  will return `GET` even if the request method was different (only
  `HEAD` is preserved). A follow up patch for Nginx is being prepared.
  As such, several tests are currently marked as "Pending".
* When the `rewrite` phase did not get a chance to run, some parts of
  the request may not have been parsed by Nginx (e.g. on HTTP 414), thus
  some variables, such as `$request_uri` or request headers may not have
  been parsed. Code should be written defensively against such
  possibilities, and logging plugins may report incomplete request data
  (but will still report the produced error).
* The `uninitialized_variable_warn` directive must be disabled in the
  `error_page` location block for cases when the `rewrite` phase did not
  get a chance to run (Nginx-produced 4xx clients errors).
* The `ngx.ctx` variables must be copied by reference for plugins to be
  able to access it in the response phases when the request was
  short-circuited by Nginx (e.g. HTTP 502/504 from the upstream module).
* Using a named location for the `error_page` handler is not possible,
  since in some cases the URI might not be parsed by Nginx, and thus Nginx
  refuses to call named locations in such cases (e.g. on HTTP 414).

Resulting behavior
------------------

The `03-plugins_triggering_spec.lua` test suite is now a good reference
for the behavior observed upon Nginx-produced errors. This behavior can
be classified in 2 categories:

1. When the request is not parsed by Nginx (HTTP 4xx)
    a. The `error_page` directive is called (`rewrite` and `access` are
       skipped)
    b. The runloop retrieves the list of **global** plugins to run (it
       cannot retrieve more specific plugins since the request could not be
       parsed)
    c. The error content handler overrides the Nginx body with a Kong-friendly
       one (unchanged from previous behavior)
    d. It executes the `header_filter` phase of each global plugin.
    e. It executes the `body_filter` phase of each global plugin.
    f. It executes the `log` phase of each global plugin.

2. When the error was produced during upstream module execution (HTTP 5xx)
    a. The `rewrite` and `access` phases ran normally (we stashed the
       `ngx.ctx` reference)
    b. The `error_page` directive is called
    c. The error content handler applies the original `ngx.ctx` by
       reference
    d. It does **not** run the plugins loop, since `ngx.ctx` contains
       the list of plugins to run (global **and** specific plugins in
       this case)
    e. The error content handler overrides the Nginx body with a Kong-friendly
       one (unchanged from previous behavior)
    f. It executes the `header_filter` phase of each global plugin.
    g. It executes the `body_filter` phase of each global plugin.
    h. It executes the `log` phase of each global plugin.

Fix #3193
@thibaultcha thibaultcha force-pushed the feat/plugins-on-nginx-errors branch from 774d326 to 3dc04a3 Compare June 15, 2018 02:14
@thibaultcha thibaultcha merged commit e709c95 into next Jun 15, 2018
thibaultcha pushed a commit that referenced this pull request Jun 15, 2018
Report both connection timeouts at the `balancer` phase and
upstream timeouts as `timeout` events when using passive healthchecks.

Previously, connection timeouts were being misreported as
`tcp_failures`.  Upstream timeouts can now be caught due to recent
runloop improvements for executing plugins on Nginx-produced errors
(#3533).

Includes regression tests for both cases.

From #3539
thibaultcha pushed a commit that referenced this pull request Jun 15, 2018
Report both connection timeouts at the `balancer` phase and
upstream timeouts as `timeout` events when using passive healthchecks.

Previously, connection timeouts were being misreported as
`tcp_failures`.  Upstream timeouts can now be caught due to recent
runloop improvements for executing plugins on Nginx-produced errors
(#3533).

Includes regression tests for both cases.

From #3539
@thibaultcha thibaultcha deleted the feat/plugins-on-nginx-errors branch June 19, 2018 18:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant