Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discovery items type error failure #253

Open
jsetton opened this issue Aug 19, 2019 · 11 comments · Fixed by #256
Open

Discovery items type error failure #253

jsetton opened this issue Aug 19, 2019 · 11 comments · Fixed by #256
Labels

Comments

@jsetton
Copy link
Collaborator

jsetton commented Aug 19, 2019

Per the skill logs, several users discovery requests are failing due to the error below

error: discover failed with error:items.forEach is not a function
{
    "stack": "TypeError: items.forEach is not a function\n    at Promise.all.then (/var/task/alexa/v3/discovery.js:45:13)\n    at <anonymous>\n    at process._tickDomainCallback (internal/process/next_tick.js:228:7)"
}

After enabling debug mode, the issue seems to be related to the items data value being a string because the request call seems to only have read/received a partial response preventing the expected JSON object from being parsed properly.

The example based on the reference below isn't that considerably big in size, although it's just partial data, and the Lambda execution time for that request was about 5 seconds.

Lambda request log reference

@jsetton jsetton added the bug label Aug 19, 2019
@jsetton
Copy link
Collaborator Author

jsetton commented Aug 19, 2019

@digitaldan I moved this one into its own issue. I am still seeing this error being generated a couple times every hour.

Anyway, I did some research and I found a possible solution including Connection: keep-alive to the request headers.

I did some testing and the request module seems to defaults to Connection: close, if not specified, which would explain why the network connection could get prematurely closed.

Do you see any reason for not making this change?

Direct Connection

REQUEST onRequestResponse http://localhost:8080/rest/items?fields=editable%2CgroupNames%2CgroupType%2Cname%2Clabel%2Cmetadata%2Cstate%2CstateDescription%2Ctags%2Ctype&metadata=alexa%2Cchannel%2Csynonyms 200 { 
  connection: 'close',
  'content-type': 'application/json',
  server: 'Jetty(9.4.18.v20190429)'
}

Behind Reverse Proxy Server (nginx)

REQUEST onRequestResponse https://myopenhab.org/rest/items?fields=editable%2CgroupNames%2CgroupType%2Cname%2Clabel%2Cmetadata%2Cstate%2CstateDescription%2Ctags%2Ctype&metadata=alexa%2Cchannel%2Csynonyms 200 {
  server: 'nginx',
  date: 'Mon, 19 Aug 2019 16:41:07 GMT',
  'content-type': 'application/json',
  'transfer-encoding': 'chunked',
  connection: 'close'
}

@digitaldan
Copy link
Collaborator

If that works i don't see a problem with it. I'm not sure why it works :-) afaik 'keep-alive' is used so the underlying TCP connection can be used multiple times by different requests and save having to create and tear down a TCP connection over and over for each request, browsers would be painfully slow without it.

If the request is still in flight, and there is a pause in the data stream, i guess it's possible something like a nginx proxy could terminate it if the keep-alive is not set, so maybe thats what it is fixing?

@jsetton
Copy link
Collaborator Author

jsetton commented Aug 20, 2019

If the request is still in flight, and there is a pause in the data stream, i guess it's possible something like a nginx proxy could terminate it if the keep-alive is not set, so maybe thats what it is fixing?

My understanding is that when sending keep-alive to the server, it will not close the connection after responding to the initial request, unless the client requests it. While the connection remains open, it can be reused and will timeout after a specific time when idling. On top of this, I believe keep-alive is enabled by default in HTTP/1.1.

I think we should enable it/deploy and see if we still get that error. Do you see any reason why we shouldn't?

@digitaldan
Copy link
Collaborator

I am totally good with it.

@jsetton
Copy link
Collaborator Author

jsetton commented Aug 21, 2019

So the connection keep-alive change I made didn't resolve this issue.

After doing more research, I believe the issue is related to the fact that the myopenhab cloud connector nginx server is buffering every request/response before passing the data on.

That buffer has a limited size and can become full at time when there are a lot simultaneous discovery requests in a short amount of time. My understanding is that when the buffer gets full, some of the data chunks may be dropped in flight causing partial response to be sent.

Going over the nginx configuration committed in the cloud service github repository, I can see that the proxy_buffering setting is enabled by default, since omitted. Is there a reason why we want to have it enabled? It seems that disabling would potentially resolve our issue.

proxy_buffering off;

Also, we could enable gzip for application/json request type to decrease the response size being sent back to the Lambda function, which would also decrease the response time I believe.

gzip on;
gzip_comp_level 6;
gzip_types application/json;

@digitaldan What do you think?

@digitaldan
Copy link
Collaborator

I'll take a look at our nginx config, i'm not 100% sure if we set that for a reason or it was copied from some default nginx proxy config example.

@jsetton
Copy link
Collaborator Author

jsetton commented Aug 29, 2019

So since we now have the request timeout logic deployed, we get the bearer token for the request that timed out in the logs.

Based on my testing on a few of them, it seems that the request times out after 60 seconds when the myopenhab cloud connector nginx server times out with a "504 Gateway Time-out" error. Anyway, I was under the impression the cloud connector can track remote clients that are offline and directly respond to external requests with an offline message response.

The reason I am bringing this up is because each of the failed requests is wasting 8 seconds of compute time. There are about 35 timed out requests across the regions per hour so far. Maybe there is something that can be easily done on the cloud connector side.

@jsetton
Copy link
Collaborator Author

jsetton commented Sep 1, 2019

According to this documentation, if Cache-Control: no-cache is specified in the request headers, nginx will not cache responses.

Additionally, after doing some test on one of the affected users, I noticed that the proper JSON object was returned on subsequent calls. So we could introduce retry logic at that level if the items data isn't valid.

@jsetton
Copy link
Collaborator Author

jsetton commented Sep 3, 2019

So the change I included in #270 adding Cache-Control: no-cache to request headers isn't resolving this issue. I will submit a PR introducing retry logic as mentioned in my previous comment. Hopefully, this will finally resolve this matter.

@jsetton
Copy link
Collaborator Author

jsetton commented Sep 4, 2019

The retry logic is helping a little bit but there are still some requests that aren't able to get the complete items data after 4 tries. I don't think we can go around not tweaking the nginx configuration. Anyway, we need to add gzip support.

One thing I noticed testing with two separate errors, the data gets always truncated at 48k (49152) no matter what the source is. This is a clear indication of some sort of buffering/paging issue on the server side.

@jsetton
Copy link
Collaborator Author

jsetton commented Sep 22, 2019

myopenhab cloud connector nginx server times out with a "504 Gateway Time-out" error

Based on this post, it seems that the timeout error may be caused by the fact the user cloud connector binding may be only setup for notification access mode while having the remote access disabled.

Is this something that can be determined at the cloud connector level so it can directly return an error instead of waiting until reaching timeout?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants