-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Server breaks on OVT connection loss #355
Comments
I'll check this out. Apart from this, I have one question about your system configuration. As you can see, OME's edge mode is actually useful for WebRTC output, not very useful for HTTP-based streaming. If your service only supports HLS streaming, why not use the following configuration? OBS -> RTMP -> OME Origin -> [HTTP Reverse proxy] <--> NGINX This will be a configuration that can solve both the "keep alive" you need. Also, OME's edge repackages the stream received from OVT into HLS, so every edge server creates slightly different chunks. Is there any special reason to use OME as an edge in HLS streaming? |
Does "server restart" mean edge or origin? |
Please provide a more detailed log to find the cause of the problem faster. |
Because that architecture probably won't work well with LL-DASH, and with HLS it would make it much more difficult to auto-scale the infrastructure because of the OME API not returning usable concurrent viewer numbers (thus making it more difficult for decent load-balancing where additional servers are added per region depending on current load numbers). Also, there are still a couple of open issues which make OME not really usable with Nginx in front of it - for example #298. Also, the suggested architecture might introduce more additional latency and jitter than just pulling the HLS stream directly from OME (which is super efficient / cheap because it needs quite little amounts of CPU / RAM).
Thankfully only edge server restart. Otherwise it would kill the stream for all viewers worldwide.
Not tested, but might work with #339 applied if lucky.
Is there any way to get a higher log level than warning but not store any IP addresses in the log (otherwise I can't do that in production because of GDPR for people who watch form within the EU). |
I understand your requirements. |
This OVT connection was between a dedicated server (origin) in AS24940 in Germany and virtual machines (edge) in AS14061 in Singapore. So I would guess there was a burst of packet loss or a latency peak because of buffers or similar. I will try to modify the logging code so that it logs less IP addresses and should be able to get some more logs this weekend for a smaller charity event. Other edge servers from other locations were not affected or were affected by the same problem at other times. Edit: I am wondering if in long-term, using SRT between edge and origin might be favorable as default because that protocol seems to be designed to be very resilient, and might be less work to maintain than OVT - depending on what you as core developers prefer. |
I haven't been able to reproduce this in the end, but I tried to solve the problem by creating a similar situation. |
With current master, the OVT connection loss still seems to cause issues which it does not automatically recover from in short time. Usually, then you will see something like this in the log:
|
Thanks for testing. (Did you update both Origin and Edge to the latest version?)
What I'm curious about is whether the connection between Edge and Origin is really broken in (2). In the current logic, if the tcp connection is physically disconnected, the OVT Provider immediately detects it and deletes the internal stream. I've tested this a number of times by quitting Origin abruptly. If you provide a detailed log of this, it will be more helpful for me to analyze it. Or can you verify that there is no connection with netstat command when the connection between servers is lost? If the TCP connection remains, the OVT Provider waits for input without deleting the stream. |
Oh, I realized there was one exception. If the connecion between Origin-Edge suddenly breaks, the edge relies on the tcp keepalive to notice this. Would you please check if TCP KEEPALIVE is active and if the time is short? If this is correct, is it correct to delete the stream when there is no input for x seconds? What do you think? |
The origin and the edge both run inside Docker containers which the OME Dockerfile builds and that runs on default Ubuntu 20.04. Thus the tcp keepalive heartbeat transmissions should only happen every 75 seconds (according to man page) and only if there was no other traffic on this connection for 75 seconds previously. So, this should never have any influence on OVT. Nowadays tcp keepalive is mostly needed so stateful connection tracking or firewalls don't drop the connection. |
If the tcp connection between Origin and Edge is not terminated normally, the edge cannot know that the connection is disconnected until the TCP keepalive expires. So after a very long time the stream of the edge will be deleted. If this is correct then I will try to develop OVT keepalive function. |
closing this as connection loss seems to be handled well enough for production, with current master branch. |
Describe the bug
For some reason a few of my overseas servers seem to die occasionally for single streams and does not recover without restart for those streams, while the API stays up and other streams keep working.
The setup is OBS -> rtmp -> OME origin -> OVT -> OME edge -> HLS
To Reproduce
Steps to reproduce the behavior:
No idea how to reproduce this yet.
Expected behavior
Stream on origin should not get in a "dead" state where it doesn't recover without server restart, even if the connection between origin and edge server dies.
Logs
(for the line numbers, see basisbit@4baa88c )
After this, the connection does not recover and the streamkey will not be usable until next restart because clients will try to continue getting the playlist file.
Server:
Player (please complete the following information):
The text was updated successfully, but these errors were encountered: