Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rust-doh closes connection with no return code on malformed request and upstream timeout #20

Closed
appliedprivacy opened this issue May 25, 2019 · 4 comments

Comments

@appliedprivacy
Copy link

Hi,

our setup:
nginx -> rust-doh -> unbound

In a small fraction (~1 out of 1000 queries) nginx returns the HTTP 502 response code which suggests that rust-doh had an issue with the request and returned something unexpected.

We never see HTTP 502 response codes when using
https://github.com/facebookexperimental/doh-proxy

Since apparently there are no rust-doh logs (#11) it is hard to further debug this.

  • You also run a public DoH server using rust-doh, how often do you see that error code (if you also have nginx)?
  • How can we debug this further?

version: 0.1.15

@jedisct1
Copy link
Member

The proxy never returns a 502 response code. But it enforces timeouts, so that clients don't wait forever. There is also a maximum number of simultaneous connections (512, by default).

When one of these limits is reached, the proxy will close the connection, probably leading to a 502 response code generated by Nginx.

Try increasing these limits to see if it helps. But then you also need to make sure that you are not going to hit system limits, such as the file descriptor limits.

I also run the DoH servers behind Nginx, but I don't log.

@appliedprivacy
Copy link
Author

appliedprivacy commented May 26, 2019

I've a slightly better understanding now but further digging is needed and help is welcome. :)

nginx config

Let's start with some excerpts from the nginx config to better understand the logs:

log_format minimaldns_upstream '$time_iso8601 $status $upstream_status $upstream_addr $geoip_country_code';
upstream rust_doh_backend {
	server 127.0.0.1:3000;
	server 127.0.0.1:3001;
}

two types of HTTP 502 log events

We observed two type of log entries related to HTTP 502:

a)

2019-05-26T16:04:06+02:00 502 502 127.0.0.1:3000 -

same for the instance on port 3001.

They occur rarely but consistently.

b)

2019-05-26T15:57:14+02:00 502 502 rust_doh_backend -

They occurred even less frequently than (a) but when they happen they came at a high number at once.

https://nginx.org/en/docs/http/ngx_http_upstream_module.html#var_upstream_addr explains the difference of the upstream_addr value:

If a server cannot be selected, the variable keeps the name of the server group.

conclusion about type (b) log events

That means during type (b) events, nginx considered all two backends unavailable.
"failed" is defined in: https://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_next_upstream

So by increasing max_fails to a value higher than 1 we can prevent type (b) events to tell nginx "try harder, don't give up so easily".

Now we need to find out more about type (a) log events. According to nginx, doh-proxy actually returned HTTP 502.
Since there are two doh-proxy instances running we are not anywhere near the default limit of concurrent connections since nginx gets less than that limit (512).

HTTP 502 consistent across instances

The following logs suggest that the error is consistent across the two instances of doh-proxy:

2019-05-26T16:44:04+02:00 502 502, 502 127.0.0.1:3000, 127.0.0.1:3001 -
2019-05-26T16:44:06+02:00 502 502, 502 127.0.0.1:3000, 127.0.0.1:3001 -
2019-05-26T16:44:38+02:00 502 502, 502 127.0.0.1:3000, 127.0.0.1:3001 -
2019-05-26T16:44:46+02:00 502 502, 502 127.0.0.1:3001, 127.0.0.1:3000 -
2019-05-26T16:44:53+02:00 502 502, 502 127.0.0.1:3001, 127.0.0.1:3000 -

@appliedprivacy
Copy link
Author

appliedprivacy commented May 31, 2019

I assume I understand the issue better now.

This is expected:

curl  http://127.0.0.1:3000/query -v
* Expire in 0 ms for 6 (transfer 0x55664e118d00)
*   Trying 127.0.0.1...
* TCP_NODELAY set
* Expire in 200 ms for 4 (transfer 0x55664e118d00)
* Connected to 127.0.0.1 (127.0.0.1) port 3000 (#0)
> GET /query HTTP/1.1
> Host: 127.0.0.1:3000
> User-Agent: curl/7.64.0
> Accept: */*
> 
< HTTP/1.1 400 Bad Request
< content-length: 0
< date: Fri, 31 May 2019 22:03:45 GMT
< 
* Connection #0 to host 127.0.0.1 left intact

This is unexpected (no response at all):

curl -d "" http://127.0.0.1:3000/query -v
* Expire in 0 ms for 6 (transfer 0x562d8ddb4d00)
*   Trying 127.0.0.1...
* TCP_NODELAY set
* Expire in 200 ms for 4 (transfer 0x562d8ddb4d00)
* Connected to 127.0.0.1 (127.0.0.1) port 3000 (#0)
> POST /query HTTP/1.1
> Host: 127.0.0.1:3000
> User-Agent: curl/7.64.0
> Accept: */*
> Content-Length: 0
> Content-Type: application/x-www-form-urlencoded
> 
* Empty reply from server
* Connection #0 to host 127.0.0.1 left intact
curl: (52) Empty reply from server

What do you think about returning 400 Bad Request in such cases of malformed requests instead of closing the connection directly?

For the upstream timeout case that currently also results in a closed connection:

What do you think about returning 504 Gateway Timeout for cases where it reaches the 10 seconds timeout?
https://tools.ietf.org/html/rfc7231#section-6.6.5

This would be nice so we could tell these cases apart in the error stats.

@appliedprivacy appliedprivacy changed the title rust-doh causing HTTP 502 (Bad Gateway) rust-doh closes connection with no return code on malformed request and upstream timeout May 31, 2019
@jedisct1
Copy link
Member

de99e6a returns a 400 error code for POST queries (this was already the case for GET queries).

Returning something on a timeout seems to be too complicated for my knowledge of Rust and Tokio. The timeout is on the socket itself, and once it fires, the HTTP layer is not accessible any more.

Something I may revisit once async I/O in Rust stabilizes and becomes a bit more friendly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants