Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kong plugins cannot log read and connection timeouts #3193

Closed
jeremyjpj0916 opened this issue Feb 2, 2018 · 34 comments
Closed

Kong plugins cannot log read and connection timeouts #3193

jeremyjpj0916 opened this issue Feb 2, 2018 · 34 comments
Labels
task/feature Requests for new features in Kong

Comments

@jeremyjpj0916
Copy link
Contributor

jeremyjpj0916 commented Feb 2, 2018

Summary

Has been discussed in the forum but since its a bug figured it would be good to make a git issue since that is easier to track long term:
https://discuss.konghq.com/t/kong-http-log-plugin-errors/445/10

Basically when it comes to read and connection timeouts Kong can't invoke any of the logging plugins on transactions.

Steps To Reproduce

  1. Enable any of the logging plugins globally/on an api/on a consumer

  2. Create an API, set the upstream read timeout property to 1ms or something crazy to force the timeout.

  3. Watch as you will see in proxy error logs the read timeout with nginx, but no plugins at the Kong layer can trap this problem currently, thibaultcha informed us:

"
kong.log() call will not execute plugins indeed, for another reason: NGINX’s error_page does an internal redirect, which causes ngx.ctx (the value storing the API and plugins to execute for this request) to be reset. Thus, no plugins will be run (Kong even lost the information relating to which API was matched in that particular request at this point).

There seems to be no easy fix for this, we’ll have to add this to our backlog and think of a workaround.
"

Additional Details & Logs

  • Kong version - All versions are impacted, we specifically are running Kong 0.12.1
  • Operating System -Alpine 3.6(Kong's docker image).

Thanks,
-Jeremy

@thibaultcha thibaultcha added the task/feature Requests for new features in Kong label Feb 5, 2018
@jeremyjpj0916
Copy link
Contributor Author

jeremyjpj0916 commented Mar 16, 2018

Yeah seeing some more of these kinda errors in nginx logs, will be super helpful when Kong figures out a way to expose these through plugins so we can index them into splunk/kibana or any other kinda external record keepers for transactions:

2018/03/16 07:13:32 [error] 40#0: *4952827 upstream prematurely closed connection while reading response header from upstream, client: 10.130.2.1, server: kong, request: "GET /test/cat/v1?Name=Test HTTP/1.0", upstream: "https://10.175.228.159:443/rest/api/test/cat/v1?Name=Test", host: "gateway-dev.company.com"

@deepshrift
Copy link

Look forward to this feature,it's important,upstream timeout should be recorded.

@jeremyjpj0916
Copy link
Contributor Author

Agreed, when working with APIs there is always opportunity for these issues to occur and currently the logging plugins are blind to these scenarios. In our use of testing around with Kong for 8+ months I have found this problem to be one of the most troublesome with no easy workaround. I don't want to tail my stdout(or the nginx log files if you leave it there by default) and write some wonky code to REST transact it to our http sinks. Hoping Kong or OpenResty come up with an elegant solution.

@jeremyjpj0916
Copy link
Contributor Author

jeremyjpj0916 commented Apr 20, 2018

Wondering if this is still on the roadmap or any progress being made here. I am up for testing in a dev environment if anyone has any crazy ideas.

@deepshrift
Copy link

I am also concerned about the progress of this issue. We are using kong as an light and high performance esb,but the timeout requet could not be recored now,some important info was lost. Hoping for your reply,Thanks a lot. @jeremyjpj0916 @thibaultcha

@jeremyjpj0916
Copy link
Contributor Author

jeremyjpj0916 commented Apr 22, 2018

@deepshrift slightly off topic, but as an ESB are you using Kong as a Gateway for SOAP services as well? Where you implement SOAP WS Security at the gateway layer via a Kong plugin? That is a feature I have wanted for awhile since many companies have deprecated SOAP services and claim they are to expensive to migrate off of. If you have anything opensource plugins of that style would love to check it out.

EDIT-
On this note, i did go back and test some SOAP services with Kong. They work fine and you can wrap them with Oauth2 client creds or JWT Auth and the logging plugins still work and such. not a bad pattern really, just not the defacto ws-security most employ with SOAP.

@jeremyjpj0916
Copy link
Contributor Author

Adding new notes here(taken from forum post), this should be the solution Kong goes with (probably should live in CORE upon my initial guess):

"
What do you think about this possibly yo : openresty/lua-nginx-module#1057 2 , you think this to be a valid way to do it in Kong and if we did where should this live? Could this be done in the template? Seems agentzh has known about this for awhile:
“The problem is that the nginx core clears all modules’ ctx data upon internal redirects.”

EDIT - Ohhh 3scale api gateway who uses nginx similarly to Kong also wrote a module for it(you can see it tied to the issue I linked, Makes sense I think that this would integrate into Kong core rather than a plugin/template I would think (it is Apache 2.0 Licensed) -
https://github.com/3scale/apicast/blob/master/gateway/src/resty/ctx.lua
And Testcase
https://github.com/3scale/apicast/blob/master/t/resty-ctx.t
"

@thibaultcha
Copy link
Member

@jeremyjpj0916 Thanks for helping in investigating our options here. As we discussed in the past, copying ngx.ctx is one of the pieces of the puzzle, but it isn't the only one (not is it necessary the only solution).

As of today, the reason why ngx.ctx has to be be copied is because we need the result of the router and the list of plugins to run (which depends on the router to have executed first, for obvious reasons), and copying ngx.ctx is the easiest "hack" we can come up with to do so. We could also copy individual values from it, or depend on a different runloop that does not require the list of plugins to be reside in ngx.ctx (e.g. a list that can be given as an argument, transmitted via a subrequest, shm, Nginx variable, etc...). Anyway, let's go down the ngx.ctx path for now.

As of today, copying ngx.ctx and adding a logging phase to the error handling locations will make logging plugins catch HTTP 502 and 504 errors (i.e. L4 failures such as connection refusals or timeouts). I tested this in a WIP branch and it works fine, as expected. Even came up with a few integration tests for it.

However, more work is required to catch HTTP 4xx errors produced by Nginx (e.g. "414 Request URI Too Large"). That is because when Nginx responds with one of these status codes, it does so before the router (i.e. Kong's proxy), is given a chance to run at all. In order to catch these, we would need to run a dummy router and construct the plugins runloop from scratch from the error handler's log phase. One could think that if the latter is done, then copying the ngx.ctx table becomes unnecessary, but not necessarily, since in proxy error cases, it will contain relevant values for logging plugins to use such as for example the number of retries by the balancer, and the intermediate errors encountered by such tries.

Another question we need to answer is whether or not such responses produced by the error handlers should trigger the header_filter and body_filter phases, In my opinion, they should, since it is already the case for request that are short-circuited by other plugins in the access phase (e.g. authentication plugins). There is no guarantee that header_filter and body_filter phases are executed only for upstream responses from proxied request, and plugins should be programmed defensively against such cases (which gives them great flexibility and power, as usual). In the future, we will revisit the plugins interface and can make more intuitive design decisions and maybe separate such handlers, but that is another topic.

Anyway, I have pushed a WIP branch with the suggested ngx.ctx copying here if you are interested. As mentioned, it only triggers logging plugins for HTTP 502 and 504 errors.

@jeremyjpj0916
Copy link
Contributor Author

This is certainly a step in the right direction and really appreciate you looking into this on a Sunday! Capturing L4 errors such as 502/504 is a fairly big game changer on the logging front IMO since many applications and networks under duress will throw those fairly frequently I have been finding. 414 and some of the other obscure errors are definitely a new topic of consideration but I am fairly convinced this little WIP will solve 90% of the logging woes Kongers face (or probably don't know they would face if their apps/network are just always flawless hah) . I will be digesting what you have done here and will test as well and report back. Thanks again!

@thibaultcha
Copy link
Member

@jeremyjpj0916 Realized I didn't push the branch before I left the laptop earlier. Done now.

@jeremyjpj0916
Copy link
Contributor Author

jeremyjpj0916 commented May 14, 2018

@thibaultcha 👍 thanks, at first I looked at it for 10 minutes earlier this afternoon thinking I must be doing something wrong and just was not familiar enough with comparing branches lol. Seeing as I generally pull down the Kong tarball in my image build I am hoping I can just overwrite the default files prior to the Kong prepare like-

cp /yourbranch/init.lua kong/default/location/init.lua

And ofc apply the template changes. If it will take anything more such as recompiling nginx or openresty I may need a line or two snippit that do that but I am hoping its just as easy as when I overwrite the default Kong plugin code with a modified lua files and just drop it right into their default locations prior to startup(probably not best practice considering yall may change those but so far can't say it has bitten me :) ).

Do you think you will roll this stuff out incrementally into the community edition and subsequently enterprise as well with just these basic l4 errors being handled logged first, or do you intend to wait until you can mature the logic a bit further to cover all possible situations this new code is applicable too(such as 4xx's)?

Regardless I am reading over the code and its starting to make sense to me how you integrated it, impressive and exciting times for Kong!

@thibaultcha
Copy link
Member

@jeremyjpj0916 I am of the opinion that if we can offer incremental improvements, we should. So handling 502/504 will be better than not. That said, it will have to go into a major release at least because of the changes to the Nginx template. Future work to handle other 4xx responses doesn’t seem like it would be breaking once we have the Nginx template changes in place.

So my take would be 0.14. We’ll schedule this in one of our next sprints.

@jeremyjpj0916
Copy link
Contributor Author

jeremyjpj0916 commented May 14, 2018

@thibaultcha I had a chance to test the code today. Does not seem to be doing what your testing resulted in. My testing of it resulted in all plugins stopped functioning all together that were enabled globally or on routes/services/api's . I was able to call oauth2/jwt protected API's as if they had no plugins enabled at all, statsD stopped logging. HTTP log plugin stopped logging. No plugins seem to be running and I did a diff between this branches changes vs our existing code and cant see any difference. To give you a taste of how I attempted to recreate your code I did the following, this is needed bit from our Dockerfile:

FROM docker.company.com/company_et/alpine:3.6

ENV KONG_VERSION 0.13.1
ENV KONG_SHA256 caf119ce7a367b9d9cd1ace55c4b6a499ff3c4693f80ea25129ebf2da0373fcc

USER root
#Add apache2-utils to second apk add line 9 if you ever want to test ab(apache benchmark) for localhost testing to bypass routers.
RUN apk add --no-cache --virtual .build-deps wget tar ca-certificates \
	&& apk add --no-cache libgcc openssl pcre perl unzip tzdata luarocks git \
	&& wget -O kong.tar.gz "https://bintray.com/kong/kong-community-edition-alpine-tar/download_file?file_path=kong-community-edition-$KONG_VERSION.apk.tar.gz" \
	&& echo "$KONG_SHA256 *kong.tar.gz" | sha256sum -c - \
	&& tar -xzf kong.tar.gz -C /tmp \
	&& rm -f kong.tar.gz \
	&& cp -R /tmp/usr / \
	&& rm -rf /tmp/usr \
	&& cp -R /tmp/etc / \
	&& rm -rf /tmp/etc \
	&& apk del .build-deps

RUN mkdir /usr/local/kong
RUN chmod -R 777 /usr/local/kong

COPY templates/custom_nginx.template /usr/local/kong/custom_nginx.template
RUN chmod 777 /usr/local/kong/custom_nginx.template

#New Changes to test enhanced logging, should be in Kong 0.14 release later
COPY patchtests/kong/init.lua /usr/local/share/lua/5.1/kong/init.lua
RUN chmod 777 /usr/local/share/lua/5.1/kong/init.lua

COPY patchtests/kong/resty/ctx.lua /usr/local/share/lua/5.1/kong/resty/ctx.lua
RUN chmod 777 /usr/local/share/lua/5.1/kong/resty/ctx.lua

Other changes were changing runloop to core in the init.lua (your tarball/deployed kong does not have a runloop, seems yall differ in that respect from packaged source to the repo source)

kongModsAndTemplate.zip

Added a zip of our custom template as well as the init.lua and ctx.lua(which is identical to the one sitting in repo).

Is there some other dependency I should be aware of when attempting to deploy these changes with the "release" codebase? Based on my Docker sample above and nginx custom template I would think I seemingly covered the bases. Of course I could wait till .14 comes out but I was going to give you confirmation it works on my end if you wanted further testing and performance analysis done with it off/on to see if any impact.

EDIT - Also logs do not reveal any lua runtime errors. Build's fine. --vv verbose mode revealed nothing either as well.

@thibaultcha
Copy link
Member

@jeremyjpj0916 I can make this as a patch on top of 0.13.1 if you want. FYI the unit and integration tests do not seem to have any worrying failures, so something probably went wrong in your environment because of discrepancies between the 0.13.1 and master branches.

@jeremyjpj0916
Copy link
Contributor Author

Yeah but I can't imagine 0.13.1 vs master have any breaking changes in the init.lua? You did intend to have this new resty ctx module live in the : /usr/local/share/lua/5.1/kong/resty/ctx.lua location correct?

Tis a head scratcher because not that much really changed. Will keep poking/prodding it. Ideally if it was on a personal bintray alpine tarball then its identical to how I have been building. Stumped until further notice, will update here if I hit a break through of seeing where the problem lies otherwise can wait a till .14 .

@thibaultcha
Copy link
Member

Yeah but I can't imagine 0.13.1 vs master have any breaking changes in the init.lua?

It does actually: f20432f.

I just pushed a tmp/0.13.1-log-plugins-on-error-page branch that you can safely apply to 0.13.1 (it should just be a commit on top of it). It has a successful CI build: https://travis-ci.org/Kong/kong/builds/378979503.

@jeremyjpj0916
Copy link
Contributor Author

jeremyjpj0916 commented May 15, 2018

@thibaultcha I def see one difference off top of my head I missed between master vs 0.13.1 changes:

  1. rockspec file change, missing the builtin modules line in path
/usr/local/lib/luarocks/rocks/kong/0.13.1-0/kong-0.13.1-0.rockspec
 ["kong.resty.ctx"] = "kong/resty/ctx.lua",

If missing lines in the rockspec like that will result in a lack of plugins running at all then that probably is it, there is a rock_manifest in the directory alongside the rockspec but reading into it I assume that gets generated at build time by kong based on what I saw within it. Already applied the runloop back to core change from your original master wip to my init.lua and the change in custom kong template is minimal and I still believe it to be correct, but will certainly diff on all the files once again later today between this new 0.13.1 wip branch and our current build out. Will get back to this thread sometime later today with results. Thanks again for your stellar support!

Edit - welp thought:

COPY patchtests/kong/kong-0.13.1-0.rockspec /usr/local/lib/luarocks/rocks/kong/0.13.1-0/kong-0.13.1-0.rockspec
RUN chmod 777 /usr/local/lib/luarocks/rocks/kong/0.13.1-0/kong-0.13.1-0.rockspec

Would do the trick. Strangely enough the rock_manifest produced still had the same hash as an unmodified 0.13.1 baseline branch.

["kong-0.13.1-0.rockspec"] = "hashHere",

And the resty/ctx.lua was no where to be found in the manifest and I assume if the rockspec did its thing correctly the produced manifest would have included something additional like:

resty = {
   ["ctx.lua"] = "hashHere"
}

Op.. jeesh 6:45am, time for nap and will resume tackling this later.

@jeremyjpj0916
Copy link
Contributor Author

jeremyjpj0916 commented May 16, 2018

Did try getting this to work some more today, beats me the difference in src vs 0.13.1 vs tarball docker method w custom template(which does work on baseline tarball 0.13.x that comes down from bintray). Might be best just to wait until this releases into v14 (don't wanna waste any more of your time trying to support me testing it) and then I will post the results here in 3-4 weeks or whatever it takes for next major release candidate. Will do a performance benchmark of an environment with identical specs one on Kong 14 with the additional ctx saving functionality for internal redirect handling and one test on 0.13.1 without it. Will see if there is an increase in latency / mem consumption. So will post here in weeks to follow, really looking forward to seeing how far Kong can harden its logging functionalities now with this newer implementation falling into place.

@bungle
Copy link
Member

bungle commented May 30, 2018

@jeremyjpj0916

Are you using this template:
https://discuss.konghq.com/t/kong-http-log-plugin-errors/445/7

You do have that log_by_lua_block in wrong place there, it should be inside location = /kong_error_handler. On your .zip it is in right place though.

@jeremyjpj0916
Copy link
Contributor Author

jeremyjpj0916 commented May 30, 2018

@bungle yeah the zip is our latest form of it. When I attempted to use these changes(and may have had some bad/old code in the core files too at times from not changing enough of the core files to match thibaultcha changes) it seemed to cause our db to stop recognizing all plugins tied to api definitions and even after reverting to older code it stayed broken where the plugins were not getting recognized even though they were in the db bringing kong up/down. I have resolved to once .14 is officially out with a tarball to pull I will come back here and to a test comparison between 13 and 14 for performance against an arbitrary API. I figure these changes will not increase latency really just use more ram due to duplicating the ctx essentially. + IMO the most important things a gateway can have is proper logging so people on both ends client / back-end know what the heck happened(especially in times of L4 errors and TLS related woes).

@thibaultcha
Copy link
Member

@jeremyjpj0916 For now we are going to experiment a few other possibilities, and @bungle volunteered to take a look at some of them in the next few days. I am sure he will keep this issue updated with his latest findings 👍

@jeremyjpj0916
Copy link
Contributor Author

Very interested to see what is produced, I thought your first crack at it was great as its what other gateways seem to do with nginx, but maybe Kong can ascend above 💯. May the best implementation win, best of luck bungle! #kongStrong

@bungle
Copy link
Member

bungle commented Jun 4, 2018

Hi @jeremyjpj0916,

There is @thibaultcha's solution:
8795505

There is my solution:
dcde814 (it includes some fixes to plugins etc. that we might want to get regardless of solution we end up).

I learned a lot in this process. In general both solutions do seem to work. My solution is a little bit bigger change fundamentally, though it might be good in a long run. I am not sure which one of these I prefer, even if I am biased to say anything here :-).

@thibaultcha's solution is (simplified):

  1. copy ngx.ctx to ngx.var on rewrite phase
  2. load ngx.ctx from ngx.varon error_page's content phase
  3. add log (and header_filter and body_filter to error_page location block)

Plus Points:

  • fundamentally smaller change, and less unknowns
  • most of kong is still isolated in location block so you may add some other locations to that same server without worrying that kong executes there too (though some things are still there like ssl_certificate_by_lua_block and error_page directives)

Minus Points:

  • Needs FFI to change Nginx internals, thought quite isolated
  • Does make internal request (like before) in which some information may be lost

My solution is (simplified)

  1. move phase handlers from location block to server block
  2. remove error_page directives
  3. handle errors with code

Plus Points:

  • No internal sub-requests (all the Kong phases are executed within same main request)
  • Makes config a little bit cleaner
  • Allows us perhaps a little bit more flexibility (error_page directive is quite restricted)

Minus Points:

  • Bigger change fundamentally (test are passing, but there can be unknowns)
  • Owns proxy server block more tightly

Third option is to use error_page still to some early Nginx caught 4xx errors such as too large headers.

@jeremyjpj0916
Copy link
Contributor Author

@bungle When I was pondering this problem originally I was wondering the reason why Kong chose to use the error_page directive in the first place(less "handle with code" I guess?). I have to say I prefer your follow up impl simply because it requires no copying of the ctx, and in lua land we only have so much memory to work with per worker thread. I also tend to not be super reserved when making changes, even if its a large one if it drives a better implementation going forward I am all for them and I feel like ridding yourself of the internal redirect puts Kong more in control of its Impl (but I am just a newbie scrub to most of this hah so its just my initial impression), I'm gonna read through the changes some more later tonight probably,

@bungle
Copy link
Member

bungle commented Jun 4, 2018

@jeremyjpj0916

There is one ugly thing:
https://github.com/Kong/kong/blob/refactor/proxy-error-handling/kong/errors.lua#L46-L63

It is really a flaky hack. The thing is that with error_page you can catch e.g. 494 but with Lua you cannot :-). Why is that? When that error occurs Nginx immediately errors that. Even before rewrite. So there is no code in Kong that gets run before that error. And with my solution it directly jumps to header_filter. But in that phase Nginx has already changed status code to 400. A more standard one, but more generic too (actually Nginx should be using for that 431 if Nginx wants to follow standards). I might send them a patch at some point. So at header filter all we have is 400 and Content-Length. I too cleverly use that Content-Length to detect the reason for 400. Perhaps it was originally 494? So this is what I think we need to get rid of. One way to do this is to set Content-Length to nil at header_filter, but then it would mean that we need to body_filter the whole body to detect the real error and then some string matching. It is ugly as hell too, and slower.

The remaining idea is to add error_page only for those Nginx so called special errors:

  • 494
  • 495
  • 496
  • 497

All non-standard error codes. I think we could even change them in some cases like 494 to 431. So for these early on client errors we will run error handler. They occur at point where Kong hasn't even ran any of the code, so that means there is no context to be copied. But the remaining question would then be, should we still run e.g. global logging plugins?

@bungle
Copy link
Member

bungle commented Jun 4, 2018

And one option is to is to still have rewrite and access inside location block (to make kong to own that server block a bit less), and move only header_filter, body_filter and log to server block. We can also come up with defaults and maybe guidanve to other cases. I also kinda like everything nicely lined in server block.

@thibaultcha
Copy link
Member

because it requires no copying of the ctx, and in lua land we only have so much memory to work with per worker thread

Just to be clear here: there is no "copying" per say, only passing a reference from one request to another.

Thanks for the write-up @bungle!

@jeremyjpj0916
Copy link
Contributor Author

Glad for the clarification, passing as a reference makes the first impl not seem as bad on mem as I was assuming then.

@bungle
Copy link
Member

bungle commented Jun 4, 2018

Yes, when I wrote that, and got away of computer. I also started wondering, whether I used word copying wrongly here (sorry for misconception). Because I had looked at the code, and I knew it was just playing with references. In general I won't expect performance wise much difference. Though, I don't have numbers. Effects in hot code paths are quite minimal on both. It is a hard to pick winner. Like tossing a coin.

@jeremyjpj0916
Copy link
Contributor Author

jeremyjpj0916 commented Jun 4, 2018

pass by value vs pass by reference | copy by value vs copy by reference are different phrasings I have always followed in Java. So pass/copy by reference is okay so I don't think you you were wrong. I just was thinking it was value and not reference :) so the clarification helped me out.

@jeremyjpj0916
Copy link
Contributor Author

jeremyjpj0916 commented Jun 8, 2018

Reviewed the code best I could, following what I know now based on thibaultcha comments + bungle insights I agree its a toss up, both have their merits and are primarily grounded on "minimal change" vs "do more with code" and they both have minus points that seem to have similar impact downsides. Interested to see which one Kong goes with. I am still EXTREMELY excited for this functionality regardless and I am sure Kong is too, will really help community/enterprise teams have better logs to diagnose the real uptime/health of their apis. My list of major Kong improvement needs will thin by 1 and is in the single digit count anyways :) . Onward and upward!

EDIT - On the third option idk how I feel about that, incorporating both would add confusion to a newbie developer(like meee 😆) as you now have error handling taking two different approaches. Functionally it may yield the best coverage/combination though? Regardless the end game should be to log as many error types as Kong possibly can to its various log sinks and maybe this method takes it there. The gateway ultimately serves as the primary proof of transaction in the API space and we need these transactions documented.

thibaultcha added a commit that referenced this issue Jun 12, 2018
Context
-------

Before this patch, several scenarios would make Kong/Nginx produce
errors that did not execute plugins:

1. Nginx-produced client errors (HTTP 4xx) on invalid requests per the
   Nginx settings (e.g. headers too large, URI too large, etc...)
2. Proxy TCP/HTTP errors on the last balancer try (e.g. connection
   refused, connection timeout, read/write timeout, etc...)
3. Any other Nginx-produced server errors (HTTP 5xx), if any

Because plugins are not executed in such cases, logging plugins in
particular do not get a chance to report such errors, leaving Kong
administrator blind when relying on bundled or homegrown logging
plugins. Additionally, other plugins, e.g. transformation, do not get a
chance to run either and could lead to scenarios where those errors
produce headers or response bodies that are undesired by the
administrator.

This issue is similar to that fixed by #3079, which handled the
following case:

4. Requests short-circuited by Kong itself (vs Nginx in 1, 2, and 3.)

See also #490 & #892 for issues related to 4. This patches addresses 1,
2, and 3.

Problem
-------

Nginx-produced errors (4xx and 5xx) are redirected to the
`/kong_error_handler` location by way of the `error_page` directive,
but the runloop is not given a chance to run, since the error handler
is only instructed to override the Nginx-defined HTML responses, and
produces Kong-friendly ones.

In short, a patch solving this issue must enable the runloop within the
error handler.

Proposed solution
-----------------

Several (many) possibilities were evaluated and explored to fix this.
The proposed patch is the result of many attempts, observations, and
adjustments to initial proposals. In a gist, it proposes:

* Executing the response part of the runloop in the error_page location.
* Ensure that if the request part of the runloop did not have a chance
  to run, we do build the list of plugins to execute (thus, global plugins
  only since the request was not processed).
* Ensure that if the request part of the runloop did run, we preserve
  its `ngx.ctx` values.

A few catches:

* Currently, `ngx.req.get_method()` in the `error_page` location block
  will return `GET` even if the request method was different (only
  `HEAD` is preserved). A follow up patch for Nginx is being prepared.
  As such, several tests are currently marked as "Pending".
* When the `rewrite` phase did not get a chance to run, some parts of
  the request may not have been parsed by Nginx (e.g. on HTTP 414), thus
  some variables, such as `$request_uri` or request headers may not have
  been parsed. Code should be written defensively against such
  possibilities, and logging plugins may report incomplete request data
  (but will still report the produced error).
* The `uninitialized_variable_warn` directive must be disabled in the
  `error_page` location block for cases when the `rewrite` phase did not
  get a chance to run (Nginx-produced 4xx clients errors).
* The `ngx.ctx` variables must be copied by reference for plugins to be
  able to access it in the response phases when the request was
  short-circuited by Nginx (e.g. HTTP 502/504 from the upstream module).
* Using a named location for the `error_page` handler is not possible,
  since in some cases the URI might not be parsed by Nginx, and thus Nginx
  refuses to call named locations in such cases (e.g. on HTTP 414).

Resulting behavior
------------------

The `03-plugins_triggering_spec.lua` test suite is now a good reference
for the behavior observed upon Nginx-produced errors. This behavior can
be classified in 2 categories:

1. When the request is not parsed by Nginx (HTTP 4xx)
    a. The `error_page` directive is called (`rewrite` and `access` are
       skipped)
    b. The runloop retrieves the list of **global** plugins to run (it
       cannot retrieve more specific plugins since the request could not be
       parsed)
    c. The error content handler overrides the Nginx body with a Kong-friendly
       one (unchanged from previous behavior)
    d. It executes the `header_filter` phase of each global plugin.
    e. It executes the `body_filter` phase of each global plugin.
    f. It executes the `log` phase of each global plugin.

2. When the error was produced during upstream module execution (HTTP 5xx)
    a. The `rewrite` and `access` phases ran normally (we stashed the
       `ngx.ctx` reference)
    b. The `error_page` directive is called
    c. The error content handler applies the original `ngx.ctx` by
       reference
    d. It does **not** run the plugins loop, since `ngx.ctx` contains
       the list of plugins to run (global **and** specific plugins in
       this case)
    e. The error content handler overrides the Nginx body with a Kong-friendly
       one (unchanged from previous behavior)
    f. It executes the `header_filter` phase of each global plugin.
    g. It executes the `body_filter` phase of each global plugin.
    h. It executes the `log` phase of each global plugin.

Fix #3193
@thibaultcha
Copy link
Member

@jeremyjpj0916 Just opened #3533

@jeremyjpj0916
Copy link
Contributor Author

@bungle @thibaultcha Exceptional engineering with great focus on detail and analysis. The resulting patch is really above and beyond my original expectation of just solving 5xx level L4 logging. Glad this is making it into the upcoming release. Puts Kong logging on a whole new level. Thanks!

thibaultcha added a commit that referenced this issue Jun 14, 2018
Context
-------

Before this patch, several scenarios would make Kong/Nginx produce
errors that did not execute plugins:

1. Nginx-produced client errors (HTTP 4xx) on invalid requests per the
   Nginx settings (e.g. headers too large, URI too large, etc...)
2. Proxy TCP/HTTP errors on the last balancer try (e.g. connection
   refused, connection timeout, read/write timeout, etc...)
3. Any other Nginx-produced server errors (HTTP 5xx), if any

Because plugins are not executed in such cases, logging plugins in
particular do not get a chance to report such errors, leaving Kong
administrator blind when relying on bundled or homegrown logging
plugins. Additionally, other plugins, e.g. transformation, do not get a
chance to run either and could lead to scenarios where those errors
produce headers or response bodies that are undesired by the
administrator.

This issue is similar to that fixed by #3079, which handled the
following case:

4. Requests short-circuited by Kong itself (vs Nginx in 1, 2, and 3.)

See also #490 & #892 for issues related to 4. This patches addresses 1,
2, and 3.

Problem
-------

Nginx-produced errors (4xx and 5xx) are redirected to the
`/kong_error_handler` location by way of the `error_page` directive,
but the runloop is not given a chance to run, since the error handler
is only instructed to override the Nginx-defined HTML responses, and
produces Kong-friendly ones.

In short, a patch solving this issue must enable the runloop within the
error handler.

Proposed solution
-----------------

Several (many) possibilities were evaluated and explored to fix this.
The proposed patch is the result of many attempts, observations, and
adjustments to initial proposals. In a gist, it proposes:

* Executing the response part of the runloop in the error_page location.
* Ensure that if the request part of the runloop did not have a chance
  to run, we do build the list of plugins to execute (thus, global plugins
  only since the request was not processed).
* Ensure that if the request part of the runloop did run, we preserve
  its `ngx.ctx` values.

A few catches:

* Currently, `ngx.req.get_method()` in the `error_page` location block
  will return `GET` even if the request method was different (only
  `HEAD` is preserved). A follow up patch for Nginx is being prepared.
  As such, several tests are currently marked as "Pending".
* When the `rewrite` phase did not get a chance to run, some parts of
  the request may not have been parsed by Nginx (e.g. on HTTP 414), thus
  some variables, such as `$request_uri` or request headers may not have
  been parsed. Code should be written defensively against such
  possibilities, and logging plugins may report incomplete request data
  (but will still report the produced error).
* The `uninitialized_variable_warn` directive must be disabled in the
  `error_page` location block for cases when the `rewrite` phase did not
  get a chance to run (Nginx-produced 4xx clients errors).
* The `ngx.ctx` variables must be copied by reference for plugins to be
  able to access it in the response phases when the request was
  short-circuited by Nginx (e.g. HTTP 502/504 from the upstream module).
* Using a named location for the `error_page` handler is not possible,
  since in some cases the URI might not be parsed by Nginx, and thus Nginx
  refuses to call named locations in such cases (e.g. on HTTP 414).

Resulting behavior
------------------

The `03-plugins_triggering_spec.lua` test suite is now a good reference
for the behavior observed upon Nginx-produced errors. This behavior can
be classified in 2 categories:

1. When the request is not parsed by Nginx (HTTP 4xx)
    a. The `error_page` directive is called (`rewrite` and `access` are
       skipped)
    b. The runloop retrieves the list of **global** plugins to run (it
       cannot retrieve more specific plugins since the request could not be
       parsed)
    c. The error content handler overrides the Nginx body with a Kong-friendly
       one (unchanged from previous behavior)
    d. It executes the `header_filter` phase of each global plugin.
    e. It executes the `body_filter` phase of each global plugin.
    f. It executes the `log` phase of each global plugin.

2. When the error was produced during upstream module execution (HTTP 5xx)
    a. The `rewrite` and `access` phases ran normally (we stashed the
       `ngx.ctx` reference)
    b. The `error_page` directive is called
    c. The error content handler applies the original `ngx.ctx` by
       reference
    d. It does **not** run the plugins loop, since `ngx.ctx` contains
       the list of plugins to run (global **and** specific plugins in
       this case)
    e. The error content handler overrides the Nginx body with a Kong-friendly
       one (unchanged from previous behavior)
    f. It executes the `header_filter` phase of each global plugin.
    g. It executes the `body_filter` phase of each global plugin.
    h. It executes the `log` phase of each global plugin.

Fix #3193
thibaultcha added a commit that referenced this issue Jun 15, 2018
Context
-------

Before this patch, several scenarios would make Kong/Nginx produce
errors that did not execute plugins:

1. Nginx-produced client errors (HTTP 4xx) on invalid requests per the
   Nginx settings (e.g. headers too large, URI too large, etc...)
2. Proxy TCP/HTTP errors on the last balancer try (e.g. connection
   refused, connection timeout, read/write timeout, etc...)
3. Any other Nginx-produced server errors (HTTP 5xx), if any

Because plugins are not executed in such cases, logging plugins in
particular do not get a chance to report such errors, leaving Kong
administrator blind when relying on bundled or homegrown logging
plugins. Additionally, other plugins, e.g. transformation, do not get a
chance to run either and could lead to scenarios where those errors
produce headers or response bodies that are undesired by the
administrator.

This issue is similar to that fixed by #3079, which handled the
following case:

4. Requests short-circuited by Kong itself (vs Nginx in 1, 2, and 3.)

See also #490 & #892 for issues related to 4. This patches addresses 1,
2, and 3.

Problem
-------

Nginx-produced errors (4xx and 5xx) are redirected to the
`/kong_error_handler` location by way of the `error_page` directive,
but the runloop is not given a chance to run, since the error handler
is only instructed to override the Nginx-defined HTML responses, and
produces Kong-friendly ones.

In short, a patch solving this issue must enable the runloop within the
error handler.

Proposed solution
-----------------

Several (many) possibilities were evaluated and explored to fix this.
The proposed patch is the result of many attempts, observations, and
adjustments to initial proposals. In a gist, it proposes:

* Executing the response part of the runloop in the error_page location.
* Ensure that if the request part of the runloop did not have a chance
  to run, we do build the list of plugins to execute (thus, global plugins
  only since the request was not processed).
* Ensure that if the request part of the runloop did run, we preserve
  its `ngx.ctx` values.

A few catches:

* Currently, `ngx.req.get_method()` in the `error_page` location block
  will return `GET` even if the request method was different (only
  `HEAD` is preserved). A follow up patch for Nginx is being prepared.
  As such, several tests are currently marked as "Pending".
* When the `rewrite` phase did not get a chance to run, some parts of
  the request may not have been parsed by Nginx (e.g. on HTTP 414), thus
  some variables, such as `$request_uri` or request headers may not have
  been parsed. Code should be written defensively against such
  possibilities, and logging plugins may report incomplete request data
  (but will still report the produced error).
* The `uninitialized_variable_warn` directive must be disabled in the
  `error_page` location block for cases when the `rewrite` phase did not
  get a chance to run (Nginx-produced 4xx clients errors).
* The `ngx.ctx` variables must be copied by reference for plugins to be
  able to access it in the response phases when the request was
  short-circuited by Nginx (e.g. HTTP 502/504 from the upstream module).
* Using a named location for the `error_page` handler is not possible,
  since in some cases the URI might not be parsed by Nginx, and thus Nginx
  refuses to call named locations in such cases (e.g. on HTTP 414).

Resulting behavior
------------------

The `03-plugins_triggering_spec.lua` test suite is now a good reference
for the behavior observed upon Nginx-produced errors. This behavior can
be classified in 2 categories:

1. When the request is not parsed by Nginx (HTTP 4xx)
    a. The `error_page` directive is called (`rewrite` and `access` are
       skipped)
    b. The runloop retrieves the list of **global** plugins to run (it
       cannot retrieve more specific plugins since the request could not be
       parsed)
    c. The error content handler overrides the Nginx body with a Kong-friendly
       one (unchanged from previous behavior)
    d. It executes the `header_filter` phase of each global plugin.
    e. It executes the `body_filter` phase of each global plugin.
    f. It executes the `log` phase of each global plugin.

2. When the error was produced during upstream module execution (HTTP 5xx)
    a. The `rewrite` and `access` phases ran normally (we stashed the
       `ngx.ctx` reference)
    b. The `error_page` directive is called
    c. The error content handler applies the original `ngx.ctx` by
       reference
    d. It does **not** run the plugins loop, since `ngx.ctx` contains
       the list of plugins to run (global **and** specific plugins in
       this case)
    e. The error content handler overrides the Nginx body with a Kong-friendly
       one (unchanged from previous behavior)
    f. It executes the `header_filter` phase of each global plugin.
    g. It executes the `body_filter` phase of each global plugin.
    h. It executes the `log` phase of each global plugin.

Fix #3193
thibaultcha added a commit that referenced this issue Jun 15, 2018
Context
-------

Before this patch, several scenarios would make Kong/Nginx produce
errors that did not execute plugins:

1. Nginx-produced client errors (HTTP 4xx) on invalid requests per the
   Nginx settings (e.g. headers too large, URI too large, etc...)
2. Proxy TCP/HTTP errors on the last balancer try (e.g. connection
   refused, connection timeout, read/write timeout, etc...)
3. Any other Nginx-produced server errors (HTTP 5xx), if any

Because plugins are not executed in such cases, logging plugins in
particular do not get a chance to report such errors, leaving Kong
administrator blind when relying on bundled or homegrown logging
plugins. Additionally, other plugins, e.g. transformation, do not get a
chance to run either and could lead to scenarios where those errors
produce headers or response bodies that are undesired by the
administrator.

This issue is similar to that fixed by #3079, which handled the
following case:

4. Requests short-circuited by Kong itself (vs Nginx in 1, 2, and 3.)

See also #490 & #892 for issues related to 4. This patches addresses 1,
2, and 3.

Problem
-------

Nginx-produced errors (4xx and 5xx) are redirected to the
`/kong_error_handler` location by way of the `error_page` directive,
but the runloop is not given a chance to run, since the error handler
is only instructed to override the Nginx-defined HTML responses, and
produces Kong-friendly ones.

In short, a patch solving this issue must enable the runloop within the
error handler.

Proposed solution
-----------------

Several (many) possibilities were evaluated and explored to fix this.
The proposed patch is the result of many attempts, observations, and
adjustments to initial proposals. In a gist, it proposes:

* Executing the response part of the runloop in the error_page location.
* Ensure that if the request part of the runloop did not have a chance
  to run, we do build the list of plugins to execute (thus, global plugins
  only since the request was not processed).
* Ensure that if the request part of the runloop did run, we preserve
  its `ngx.ctx` values.

A few catches:

* Currently, `ngx.req.get_method()` in the `error_page` location block
  will return `GET` even if the request method was different (only
  `HEAD` is preserved). A follow up patch for Nginx is being prepared.
  As such, several tests are currently marked as "Pending".
* When the `rewrite` phase did not get a chance to run, some parts of
  the request may not have been parsed by Nginx (e.g. on HTTP 414), thus
  some variables, such as `$request_uri` or request headers may not have
  been parsed. Code should be written defensively against such
  possibilities, and logging plugins may report incomplete request data
  (but will still report the produced error).
* The `uninitialized_variable_warn` directive must be disabled in the
  `error_page` location block for cases when the `rewrite` phase did not
  get a chance to run (Nginx-produced 4xx clients errors).
* The `ngx.ctx` variables must be copied by reference for plugins to be
  able to access it in the response phases when the request was
  short-circuited by Nginx (e.g. HTTP 502/504 from the upstream module).
* Using a named location for the `error_page` handler is not possible,
  since in some cases the URI might not be parsed by Nginx, and thus Nginx
  refuses to call named locations in such cases (e.g. on HTTP 414).

Resulting behavior
------------------

The `03-plugins_triggering_spec.lua` test suite is now a good reference
for the behavior observed upon Nginx-produced errors. This behavior can
be classified in 2 categories:

1. When the request is not parsed by Nginx (HTTP 4xx)
    a. The `error_page` directive is called (`rewrite` and `access` are
       skipped)
    b. The runloop retrieves the list of **global** plugins to run (it
       cannot retrieve more specific plugins since the request could not be
       parsed)
    c. The error content handler overrides the Nginx body with a Kong-friendly
       one (unchanged from previous behavior)
    d. It executes the `header_filter` phase of each global plugin.
    e. It executes the `body_filter` phase of each global plugin.
    f. It executes the `log` phase of each global plugin.

2. When the error was produced during upstream module execution (HTTP 5xx)
    a. The `rewrite` and `access` phases ran normally (we stashed the
       `ngx.ctx` reference)
    b. The `error_page` directive is called
    c. The error content handler applies the original `ngx.ctx` by
       reference
    d. It does **not** run the plugins loop, since `ngx.ctx` contains
       the list of plugins to run (global **and** specific plugins in
       this case)
    e. The error content handler overrides the Nginx body with a Kong-friendly
       one (unchanged from previous behavior)
    f. It executes the `header_filter` phase of each global plugin.
    g. It executes the `body_filter` phase of each global plugin.
    h. It executes the `log` phase of each global plugin.

Fix #3193
From #3533
@thibaultcha
Copy link
Member

@jeremyjpj0916 Thank you for the kind words. Now closing this since #3533 has been merged - expect a 0.14.0rc1 in a few days. Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
task/feature Requests for new features in Kong
Projects
None yet
Development

No branches or pull requests

4 participants