Server breaks on OVT connection loss #355

basisbit · 2021-05-02T01:13:39Z

Describe the bug
For some reason a few of my overseas servers seem to die occasionally for single streams and does not recover without restart for those streams, while the API stays up and other streams keep working.
The setup is OBS -> rtmp -> OME origin -> OVT -> OME edge -> HLS

To Reproduce
Steps to reproduce the behavior:
No idea how to reproduce this yet.

Expected behavior
Stream on origin should not get in a "dead" state where it doesn't recover without server restart, even if the connection between origin and edge server dies.

Logs
(for the line numbers, see basisbit@4baa88c )

[2021-05-01 09:01:09.797] E [SegWorker:17] HLS | hls_stream_server.cpp:58   | Could not find a HLS Publisher playlist for [#default#app123/Streamkey123], playlist.m3u8 : 202
[2021-05-01 09:01:13.940] E [SPSegPub:14] Socket | socket.cpp:1275 | [#292] [0x7f82fe439d10] An error occurred while read data: [errno] Connection timed out (110)
Stack trace: #0   /opt/ovenmediaengine/bin/OvenMediaEngine 0x55d3103ee26c ? + 0x27026c
#1   /opt/ovenmediaengine/bin/OvenMediaEngine 0x55d3103f6ae4 ? + 0x278ae4
#2   /opt/ovenmediaengine/bin/OvenMediaEngine 0x55d3103fbc04 ? + 0x27dc04
#3   /usr/lib/x86_64-linux-gnu/libstdc++.so.6 0x7f83800ce6df ? + 0xbd6df
#4   /lib/x86_64-linux-gnu/libpthread.so.0 0x7f83826196db ? + 0x76db
#5   /lib/x86_64-linux-gnu/libc.so.6     0x7f837f78b71f clone + 0x3f

After this, the connection does not recover and the streamkey will not be usable until next restart because clients will try to continue getting the playlist file.

Server:

OS: Ubuntu 20.04 + current docker.io
OvenMediaEngine Version: mostly current master (a fork with a waiting PR merged: basisbit@4baa88c )
Branch: master

Player (please complete the following information):

Device/OS/Browser: various

The text was updated successfully, but these errors were encountered:

getroot · 2021-05-03T01:56:27Z

I'll check this out.

Apart from this, I have one question about your system configuration. As you can see, OME's edge mode is actually useful for WebRTC output, not very useful for HTTP-based streaming. If your service only supports HLS streaming, why not use the following configuration?

OBS -> RTMP -> OME Origin -> [HTTP Reverse proxy] <--> NGINX

This will be a configuration that can solve both the "keep alive" you need. Also, OME's edge repackages the stream received from OVT into HLS, so every edge server creates slightly different chunks.

Is there any special reason to use OME as an edge in HLS streaming?

getroot · 2021-05-03T02:58:35Z

Does "server restart" mean edge or origin?
Will the stream recover when the edge server restarts? Or will the stream recover when the origin server is restarted?

getroot · 2021-05-03T03:05:03Z

Please provide a more detailed log to find the cause of the problem faster.

basisbit · 2021-05-04T11:38:49Z

If your service only supports HLS streaming, why not use the following configuration? ... Nginx

Because that architecture probably won't work well with LL-DASH, and with HLS it would make it much more difficult to auto-scale the infrastructure because of the OME API not returning usable concurrent viewer numbers (thus making it more difficult for decent load-balancing where additional servers are added per region depending on current load numbers). Also, there are still a couple of open issues which make OME not really usable with Nginx in front of it - for example #298. Also, the suggested architecture might introduce more additional latency and jitter than just pulling the HLS stream directly from OME (which is super efficient / cheap because it needs quite little amounts of CPU / RAM).

Does "server restart" mean edge or origin?

Thankfully only edge server restart. Otherwise it would kill the stream for all viewers worldwide.

Or will the stream recover when the origin server is restarted?

Not tested, but might work with #339 applied if lucky.

Please provide a more detailed log to find the cause of the problem faster.

Is there any way to get a higher log level than warning but not store any IP addresses in the log (otherwise I can't do that in production because of GDPR for people who watch form within the EU).
Anyways, the problem should be reproducible by starting the stream to the origin server, request the playlist file from the edge server and then before the third chunk file is finished, break the OVT connection between the edge and the origin server (see stack trace above).

getroot · 2021-05-06T11:19:45Z

I understand your requirements.
I am trying to reproduce this problem. This issue will be resolved soon.
Do you know why the connection between Origin and Edge is broken? It will help me to reproduce.

basisbit · 2021-05-06T13:00:09Z

This OVT connection was between a dedicated server (origin) in AS24940 in Germany and virtual machines (edge) in AS14061 in Singapore. So I would guess there was a burst of packet loss or a latency peak because of buffers or similar. I will try to modify the logging code so that it logs less IP addresses and should be able to get some more logs this weekend for a smaller charity event.

Other edge servers from other locations were not affected or were affected by the same problem at other times.

Edit: I am wondering if in long-term, using SRT between edge and origin might be favorable as default because that protocol seems to be designed to be very resilient, and might be less work to maintain than OVT - depending on what you as core developers prefer.

getroot · 2021-05-17T04:50:50Z

I haven't been able to reproduce this in the end, but I tried to solve the problem by creating a similar situation.
I tested by repeatedly turning off and on the Origin server while the Edge server was connected, and I found and fixed some issues related to sockets.
I hope this solved your problem too.

basisbit · 2021-05-24T01:48:33Z

With current master, the OVT connection loss still seems to cause issues which it does not automatically recover from in short time. Usually, then you will see something like this in the log:

edge_1  | [2021-05-24 01:44:48.768] E [StreamCollector:50] Provider | stream_motor.cpp:110  | #default#sample/test(104) Stream could not be deleted to the epoll (err : -1)

getroot · 2021-05-24T07:21:08Z

Thanks for testing. (Did you update both Origin and Edge to the latest version?)

Edge looks for a stream inside when a player requests it to play. And if there is no stream inside, it immediately tries to connect to Origin using OVT and creates a stream.
If Edge loses connection with Origin, it deletes the stream. After that, if the player makes a request again, repeat (1).

What I'm curious about is whether the connection between Edge and Origin is really broken in (2). In the current logic, if the tcp connection is physically disconnected, the OVT Provider immediately detects it and deletes the internal stream. I've tested this a number of times by quitting Origin abruptly.

If you provide a detailed log of this, it will be more helpful for me to analyze it. Or can you verify that there is no connection with netstat command when the connection between servers is lost? If the TCP connection remains, the OVT Provider waits for input without deleting the stream.

getroot · 2021-05-24T07:47:05Z

Oh, I realized there was one exception.

If the connecion between Origin-Edge suddenly breaks, the edge relies on the tcp keepalive to notice this. Would you please check if TCP KEEPALIVE is active and if the time is short?

If this is correct, is it correct to delete the stream when there is no input for x seconds? What do you think?

basisbit · 2021-05-24T10:44:41Z

The origin and the edge both run inside Docker containers which the OME Dockerfile builds and that runs on default Ubuntu 20.04. Thus the tcp keepalive heartbeat transmissions should only happen every 75 seconds (according to man page) and only if there was no other traffic on this connection for 75 seconds previously. So, this should never have any influence on OVT. Nowadays tcp keepalive is mostly needed so stateful connection tracking or firewalls don't drop the connection.

getroot · 2021-05-25T02:48:27Z

If the tcp connection between Origin and Edge is not terminated normally, the edge cannot know that the connection is disconnected until the TCP keepalive expires. So after a very long time the stream of the edge will be deleted. If this is correct then I will try to develop OVT keepalive function.

basisbit · 2021-05-31T11:56:00Z

closing this as connection loss seems to be handled well enough for production, with current master branch.

basisbit changed the title ~~Server crash on OVT connection loss~~ Server breaks on OVT connection loss May 2, 2021

basisbit closed this as completed May 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Server breaks on OVT connection loss #355

Server breaks on OVT connection loss #355

basisbit commented May 2, 2021

getroot commented May 3, 2021 •

edited

getroot commented May 3, 2021

getroot commented May 3, 2021

basisbit commented May 4, 2021 •

edited

getroot commented May 6, 2021

basisbit commented May 6, 2021 •

edited

getroot commented May 17, 2021 •

edited

basisbit commented May 24, 2021

getroot commented May 24, 2021

getroot commented May 24, 2021

basisbit commented May 24, 2021 •

edited

getroot commented May 25, 2021

basisbit commented May 31, 2021

Server breaks on OVT connection loss #355

Server breaks on OVT connection loss #355

Comments

basisbit commented May 2, 2021

getroot commented May 3, 2021 • edited

getroot commented May 3, 2021

getroot commented May 3, 2021

basisbit commented May 4, 2021 • edited

getroot commented May 6, 2021

basisbit commented May 6, 2021 • edited

getroot commented May 17, 2021 • edited

basisbit commented May 24, 2021

getroot commented May 24, 2021

getroot commented May 24, 2021

basisbit commented May 24, 2021 • edited

getroot commented May 25, 2021

basisbit commented May 31, 2021

getroot commented May 3, 2021 •

edited

basisbit commented May 4, 2021 •

edited

basisbit commented May 6, 2021 •

edited

getroot commented May 17, 2021 •

edited

basisbit commented May 24, 2021 •

edited