Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Event stream cannot always be opened #6338

Closed
2 of 4 tasks
johanstokking opened this issue Jun 23, 2023 · 5 comments
Closed
2 of 4 tasks

Event stream cannot always be opened #6338

johanstokking opened this issue Jun 23, 2023 · 5 comments
Assignees
Labels
c/console This is related to the Console needs/discussion We need to discuss this
Milestone

Comments

@johanstokking
Copy link
Member

johanstokking commented Jun 23, 2023

Summary

The event stream cannot always be opened and/or it gets aborted, so that the live traffic view does not work.

Steps to Reproduce

Unfortunately I do not have clear reproduction steps. It did happen multiple times now.

So far I only encountered this in the end device live traffic view, being unable to see the simulated uplinks. I do get to see the "last activity" timer updated, but this is probably working locally.

Current Result

The previous events are shown, but new events do not come in.

Expected Result

The live traffic view works.

Relevant Logs

When this happens, I see in the browser network panel that the POST request to /api/v3/events fails with NS_BINDING_ABORTED.

URL

No response

Deployment

The Things Stack Community Edition

The Things Stack Version

3.26.1

Client Name and Version

Firefox 114.0.2

Other Information

No response

Proposed Fix

According to https://stackoverflow.com/questions/704561/ns-binding-aborted-shown-in-firefox-with-httpfox, this may be related to caching. I also see other potential reasons.

Contributing

  • I can help by doing more research.
  • I can help by implementing a fix after the proposal above is approved.
  • I can help by testing the fix before it's released.

Code of Conduct

@johanstokking johanstokking added c/console This is related to the Console needs/triage We still need to triage this labels Jun 23, 2023
@NicolasMrad NicolasMrad added the needs/discussion We need to discuss this label Jun 27, 2023
@NicolasMrad NicolasMrad removed the needs/triage We still need to triage this label Jun 27, 2023
@NicolasMrad NicolasMrad added this to the 2023 Q3 milestone Jun 27, 2023
@kschiffer
Copy link
Contributor

kschiffer commented Jun 27, 2023

So far I only encountered this in the end device live traffic view, being unable to see the simulated uplinks. I do get to see the "last activity" timer updated, but this is probably working locally.

"Last activity" uses the application-level event stream, while when looking at the end device event, another separate end device event stream is opened.

I remember an earlier issue where the event stream would not open if too many connections are already open or pending. Iirc, this happened because event streams were not closed after leaving the live data view. Having a regression there could be one possible cause. I'll check to see if we can rule this out.

@johanstokking
Copy link
Member Author

johanstokking commented Jun 28, 2023

That's likely. It seems also that there are only six concurrent connections allowed per browser. So if we already consume two (application + device) per device tab, this quickly adds up, not even considering what the other tabs keep open.

In any case, I really think we should switch to websockets for the Console which does not fall under our API compatibility commitment, and recommend against using the SSE endpoint for use in browsers. This is not about being right or wrong, about who to blame or about fixing this for a particular browser/tabs open/network combination. This is simply about improving user experience and saving support time on our side. I discussed this offline with @adriansmares, mentioning him to share his thoughts himself for the record.

I don't think we should escalate this now to endpoints that the Console uses instead of gRPC gateway directly. The omitted fields issue is very annoying too, not just for the Console, but for everyone, not only when using our gRPC API but also in webhooks and MQTT. We never wanted to touch this because of gogoproto, but now we're cleared to gradually and incrementally improve our developer experience on this front. This is not particular to the Console, so let's focus on a websockets event stream.

As websockets are bidirectional, we can let the browser send "request" messages to subscribe and unsubscribe from events, filtering event names and/or verbose mode. This would allow us to maintain one websocket connection and multiplex entity events, as long as the Console correctly unsubscribes when the user is navigating away. Backend wise this would mean multiple event subscriptions per websocket connection that are dynamically created and released, and events are all funneled in JSON over the websocket connection.


For background, this is really painful for customers and very hard to debug. Example user report from earlier this week:

FYI: on Chrome I received this error after I left opened the device page for a while (live data tab)

{
"time": "2023-06-26T14:15:04.108Z",
"name": "synthetic.error.unknown",
"isError": true,
"isSynthetic": true,
"unique_id": "synthetic.1687788904108",
"data": {
"error": "TypeError: Failed to fetch"
}
}

On the other side, on Firefox and Edge I received the frames I sent. Unfortunately while I was monitoring the streaming nothing showed up, but suddenly all the frames were displayed all together just before “the connection was closed by the stream provider” message. So it seems like I cannot get anything until the stream closure. Hopefully this is happening only on my laptop, but I will be able to confirm this after a few other tests. During the afternoon I’m going to connect to other networks to see if this issue isn’t going to pop up with other network architectures, different from the one I was connected during our meeting.

I leave the screenshot of Edge for your reference.

And then a screenshot showing events that just stop with a warning and don't recover.

@adriansmares
Copy link
Contributor

adriansmares commented Jun 28, 2023

Please note that the current event stream is based on HTTP2 and all of the streams go via the same physical connection - there is multiplexing already in HTTP2, and everything goes via a singular TLS connection. You can test this today by opening more than 6 tabs with different end devices (or even the same one) and observe that you still receive the traffic. The 6 connection limit is not relevant for the issue that we are having right now.

If we move to WebSockets, we cannot use HTTP2 (there are no WebSockets in HTTP2) and then we really have at most 6 tabs open at the same time, because a WebSockets connection is really one physical TLS connection. The tradeoff between WebSockets and HTTP2 long polling is not as trivial as we are making it look like here.

I still believe that we are mishandling the streams in the Console and this causes issues. We've been using these event streams for years, but only recently have started to receive reports regarding the missing events or frozen streams. The problem is real, but I don't think that refactoring this to WebSockets is the solution that we should be rushing towards.

@kschiffer
Copy link
Contributor

This might have been resolved via #6387.

@johanstokking can you check if you can still recreate this?

@johanstokking
Copy link
Member Author

Good, let's close this and I'll reopen if I encounter this again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/console This is related to the Console needs/discussion We need to discuss this
Projects
None yet
Development

No branches or pull requests

4 participants