Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server breaks on OVT connection loss #355

Closed
basisbit opened this issue May 2, 2021 · 13 comments
Closed

Server breaks on OVT connection loss #355

basisbit opened this issue May 2, 2021 · 13 comments

Comments

@basisbit
Copy link
Contributor

basisbit commented May 2, 2021

Describe the bug
For some reason a few of my overseas servers seem to die occasionally for single streams and does not recover without restart for those streams, while the API stays up and other streams keep working.
The setup is OBS -> rtmp -> OME origin -> OVT -> OME edge -> HLS

To Reproduce
Steps to reproduce the behavior:
No idea how to reproduce this yet.

Expected behavior
Stream on origin should not get in a "dead" state where it doesn't recover without server restart, even if the connection between origin and edge server dies.

Logs
(for the line numbers, see basisbit@4baa88c )

[2021-05-01 09:01:09.797] E [SegWorker:17] HLS | hls_stream_server.cpp:58   | Could not find a HLS Publisher playlist for [#default#app123/Streamkey123], playlist.m3u8 : 202
[2021-05-01 09:01:13.940] E [SPSegPub:14] Socket | socket.cpp:1275 | [#292] [0x7f82fe439d10] An error occurred while read data: [errno] Connection timed out (110)
Stack trace: #0   /opt/ovenmediaengine/bin/OvenMediaEngine 0x55d3103ee26c ? + 0x27026c
#1   /opt/ovenmediaengine/bin/OvenMediaEngine 0x55d3103f6ae4 ? + 0x278ae4
#2   /opt/ovenmediaengine/bin/OvenMediaEngine 0x55d3103fbc04 ? + 0x27dc04
#3   /usr/lib/x86_64-linux-gnu/libstdc++.so.6 0x7f83800ce6df ? + 0xbd6df
#4   /lib/x86_64-linux-gnu/libpthread.so.0 0x7f83826196db ? + 0x76db
#5   /lib/x86_64-linux-gnu/libc.so.6     0x7f837f78b71f clone + 0x3f

After this, the connection does not recover and the streamkey will not be usable until next restart because clients will try to continue getting the playlist file.

Server:

  • OS: Ubuntu 20.04 + current docker.io
  • OvenMediaEngine Version: mostly current master (a fork with a waiting PR merged: basisbit@4baa88c )
  • Branch: master

Player (please complete the following information):

  • Device/OS/Browser: various
@basisbit basisbit changed the title Server crash on OVT connection loss Server breaks on OVT connection loss May 2, 2021
@getroot
Copy link
Sponsor Member

getroot commented May 3, 2021

I'll check this out.

Apart from this, I have one question about your system configuration. As you can see, OME's edge mode is actually useful for WebRTC output, not very useful for HTTP-based streaming. If your service only supports HLS streaming, why not use the following configuration?

OBS -> RTMP -> OME Origin -> [HTTP Reverse proxy] <--> NGINX

This will be a configuration that can solve both the "keep alive" you need. Also, OME's edge repackages the stream received from OVT into HLS, so every edge server creates slightly different chunks.

Is there any special reason to use OME as an edge in HLS streaming?

@getroot
Copy link
Sponsor Member

getroot commented May 3, 2021

Does "server restart" mean edge or origin?
Will the stream recover when the edge server restarts? Or will the stream recover when the origin server is restarted?

@getroot
Copy link
Sponsor Member

getroot commented May 3, 2021

Please provide a more detailed log to find the cause of the problem faster.

@basisbit
Copy link
Contributor Author

basisbit commented May 4, 2021

If your service only supports HLS streaming, why not use the following configuration? ... Nginx

Because that architecture probably won't work well with LL-DASH, and with HLS it would make it much more difficult to auto-scale the infrastructure because of the OME API not returning usable concurrent viewer numbers (thus making it more difficult for decent load-balancing where additional servers are added per region depending on current load numbers). Also, there are still a couple of open issues which make OME not really usable with Nginx in front of it - for example #298. Also, the suggested architecture might introduce more additional latency and jitter than just pulling the HLS stream directly from OME (which is super efficient / cheap because it needs quite little amounts of CPU / RAM).

Does "server restart" mean edge or origin?

Thankfully only edge server restart. Otherwise it would kill the stream for all viewers worldwide.

Or will the stream recover when the origin server is restarted?

Not tested, but might work with #339 applied if lucky.

Please provide a more detailed log to find the cause of the problem faster.

Is there any way to get a higher log level than warning but not store any IP addresses in the log (otherwise I can't do that in production because of GDPR for people who watch form within the EU).
Anyways, the problem should be reproducible by starting the stream to the origin server, request the playlist file from the edge server and then before the third chunk file is finished, break the OVT connection between the edge and the origin server (see stack trace above).

@getroot
Copy link
Sponsor Member

getroot commented May 6, 2021

I understand your requirements.
I am trying to reproduce this problem. This issue will be resolved soon.
Do you know why the connection between Origin and Edge is broken? It will help me to reproduce.

@basisbit
Copy link
Contributor Author

basisbit commented May 6, 2021

This OVT connection was between a dedicated server (origin) in AS24940 in Germany and virtual machines (edge) in AS14061 in Singapore. So I would guess there was a burst of packet loss or a latency peak because of buffers or similar. I will try to modify the logging code so that it logs less IP addresses and should be able to get some more logs this weekend for a smaller charity event.

Other edge servers from other locations were not affected or were affected by the same problem at other times.

Edit: I am wondering if in long-term, using SRT between edge and origin might be favorable as default because that protocol seems to be designed to be very resilient, and might be less work to maintain than OVT - depending on what you as core developers prefer.

@getroot
Copy link
Sponsor Member

getroot commented May 17, 2021

I haven't been able to reproduce this in the end, but I tried to solve the problem by creating a similar situation.
I tested by repeatedly turning off and on the Origin server while the Edge server was connected, and I found and fixed some issues related to sockets.
I hope this solved your problem too.

@basisbit
Copy link
Contributor Author

With current master, the OVT connection loss still seems to cause issues which it does not automatically recover from in short time. Usually, then you will see something like this in the log:

edge_1  | [2021-05-24 01:44:48.768] E [StreamCollector:50] Provider | stream_motor.cpp:110  | #default#sample/test(104) Stream could not be deleted to the epoll (err : -1)

@getroot
Copy link
Sponsor Member

getroot commented May 24, 2021

Thanks for testing. (Did you update both Origin and Edge to the latest version?)

  1. Edge looks for a stream inside when a player requests it to play. And if there is no stream inside, it immediately tries to connect to Origin using OVT and creates a stream.

  2. If Edge loses connection with Origin, it deletes the stream. After that, if the player makes a request again, repeat (1).

What I'm curious about is whether the connection between Edge and Origin is really broken in (2). In the current logic, if the tcp connection is physically disconnected, the OVT Provider immediately detects it and deletes the internal stream. I've tested this a number of times by quitting Origin abruptly.

If you provide a detailed log of this, it will be more helpful for me to analyze it. Or can you verify that there is no connection with netstat command when the connection between servers is lost? If the TCP connection remains, the OVT Provider waits for input without deleting the stream.

@getroot
Copy link
Sponsor Member

getroot commented May 24, 2021

Oh, I realized there was one exception.

If the connecion between Origin-Edge suddenly breaks, the edge relies on the tcp keepalive to notice this. Would you please check if TCP KEEPALIVE is active and if the time is short?

If this is correct, is it correct to delete the stream when there is no input for x seconds? What do you think?

@basisbit
Copy link
Contributor Author

basisbit commented May 24, 2021

The origin and the edge both run inside Docker containers which the OME Dockerfile builds and that runs on default Ubuntu 20.04. Thus the tcp keepalive heartbeat transmissions should only happen every 75 seconds (according to man page) and only if there was no other traffic on this connection for 75 seconds previously. So, this should never have any influence on OVT. Nowadays tcp keepalive is mostly needed so stateful connection tracking or firewalls don't drop the connection.

@getroot
Copy link
Sponsor Member

getroot commented May 25, 2021

If the tcp connection between Origin and Edge is not terminated normally, the edge cannot know that the connection is disconnected until the TCP keepalive expires. So after a very long time the stream of the edge will be deleted. If this is correct then I will try to develop OVT keepalive function.

@basisbit
Copy link
Contributor Author

closing this as connection loss seems to be handled well enough for production, with current master branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants