Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Really high CPU load over time #1356

Closed
dotarmin opened this issue Dec 18, 2020 · 28 comments
Closed

Really high CPU load over time #1356

dotarmin opened this issue Dec 18, 2020 · 28 comments
Labels

Comments

@dotarmin
Copy link
Contributor

dotarmin commented Dec 18, 2020

Expected behaviour

Be able to play clips, both long and short without having to worry about the CPU load.

Current behaviour

When playing shorter clips using v2.3.0 LTS (even in v2.2.0), the CPU load goes to 90-92% over time and is stuck there. I have attached some screens to show how it looks like. For longer clips we do not see this behaviour.

Shorter clips = around 20 seconds
Longer clips = hours

I think it has to do with the number of commands sent and that it's not related to the actual file length, but it's just a theory.

  • v2.3.0 LTS does not crash when this happen
  • 2.2.0 does crash when this happen
  • v2.0.7 - Works
Used commands (from automation system)

LOAD
PLAY

LOAD
PLAY

Environment

  • Server version: v2.3.0 LTS
  • Operating system: Windows 7 x64
  • 8 decklink channels (fill only) configured but only 2 actively used

Screenshots

image01

image02

image03

image04

@TondaKrist
Copy link

We are experiencing that too. After some period CasparCG 2.3 LTS process stucks at 99% and then fails. Even after STOPping all layers and playing only one then.

@ronag
Copy link
Member

ronag commented Jan 8, 2021

I have seen this too.

@ronag
Copy link
Member

ronag commented Jan 8, 2021

Does anyone have reliable repro steps?

@Julusian
Copy link
Member

Julusian commented Jan 8, 2021

@scriptorian is able to reproduce this and is having a look into the cause

@hummelstrand
Copy link
Member

Seems like it can be reproduced by issuing multiple LOAD and PLAY commands over time.

@TondaKrist
Copy link

Reproducable after multiple PLAY and LOADBG commands over time as @hummelstrand mentioned - even on single layer. I will prepare commands log to reproduce.

@scriptorian
Copy link
Contributor

As mentioned I have managed to reproduce this with a test script that repeatedly LOADs a clip onto a channel/layer (using the ffmpeg producer). No PLAY is required to provoke the fault. For testing I have made the script loop every 200ms and this makes the problem apparent in a reasonable amount of time. The first symptom is the process working set increasing linearly, then after a few minutes the CPU load starts increasing too.

I have analysed the application using various tools and confirmed that it is working well and not leaking any threads or objects on the heap (with the exception of one rare bug that I have addressed - not relevant to this problem) which is great news but frustrating in terms of finding the problem. I recently tried running Windows Performance Analyzer and finally found a clue. By comparing CPU usage early and late in a run it was apparent that an increasing amount of time was spent in the TBB library and with cleaning up thread local storage. With some very simple (and not production ready!) hacking I removed the TBB thread parallel optimisations in the ffmpeg producer and the memory and CPU growth problem disappeared.

I don't believe there is anything wrong with the CasparCG code that uses this library so my next step will be to get an updated version of the TBB library and try again with that. The release notes mention some bugfixes that may be relevant. Intel have now wrapped it into their new oneAPI product and installing that failed for me just now. If anyone here has experience of this library (@ronag?) I'd be grateful for any pointers for how you cooked it / downloaded it last time.

@ronag
Copy link
Member

ronag commented Jan 20, 2021

Try skipping the custom tbb stuff and use the regular ffmpeg thread pool?

@scriptorian
Copy link
Contributor

Thanks @ronag. If you are referring to to the override of AVFilterGraph::execute that is currently using TBB as the custom multithreading implementation then yes, I have turned this off. The real difference with this problem though is in the tbb::parallel_invoke and tbb::parallel_for_each calls in av_producer and av_util. Removing these stops the problem, removing just one of them halves the rate of growth!

@ronag
Copy link
Member

ronag commented Jan 20, 2021

For now just remove the tbb stuff. We can follow up with another PR with an updated tbb version later.

@ronag
Copy link
Member

ronag commented Jan 20, 2021

I don't know how to update tbb at the moment since intel wrapped it into oneAPI.

@ronag
Copy link
Member

ronag commented Jan 20, 2021

@ronag
Copy link
Member

ronag commented Jan 20, 2021

Do we know if this problem occurs on Linux?

@scriptorian
Copy link
Contributor

Thanks for the suggestions. I've got hold of the latest tbb now and I think the best approach is to push through with trying that. If the problem has gone away then there are no code changes (any tbb interface changes notwithstanding) and linux should continue to work - hopefully without any problems. Any other approach would require a fair amount of code changes with potentially surprising impacts on performance and that seems like something to avoid if possible.

@TondaKrist
Copy link

Sorry, is it something we can fix via some TBB tweaking in Windows, or not?

@scriptorian
Copy link
Contributor

I have now downloaded and built with the latest TBB library from the Intel oneAPI product. There were some API changes but dealing with these was straightforward and should be safe.
The good news is that this completely fixed the growing CPU and memory problems. I have left my test script running for a good long time and everything stayed very steady.

@TondaKrist
Copy link

Awesome, will it be included in some future builds of CasparCG? Or can you please provide your build for long time testing?

@scriptorian
Copy link
Contributor

We are just discussing how to progress with testing this change and whether to make a beta version. Does anyone here have any thoughts? I'll update this thread when we have a plan!

@hummelstrand
Copy link
Member

Please beta test and report any issues here!
https://github.com/CasparCG/server/releases/tag/v2.3.2-lts-beta

@dimitry-ishenko
Copy link
Contributor

Is this something to worry about on Linux? (Running NRK version).

@scriptorian
Copy link
Contributor

It's not clear whether the TBB bug also exists in the Linux version. The TBB release notes include some mentions of fixing relevant bugs in the Windows version so there is reasonable hope that this problem won't affect Linux.
The updated TBB library is available for Linux so it should be straightforward to make an updated build if problems appear.

@hummelstrand
Copy link
Member

Is this something to worry about on Linux? (Running NRK version).

The latest NRK version of CasparCG Server is v2.1, so it is not affected by this bug which seems to have been introduced in v2.2.

@dimitry-ishenko
Copy link
Contributor

OK I get it. Thank you @scriptorian and @hummelstrand

@martastain
Copy link

Just FYI: It seems there is no problem with increasing CPU load on 2.3.2 beta on Windows 10 (yellow lines). There is just a slight memory usage increase over time but from my experience, it will eventually drop.

Green lines belong to a custom 2.3.0 build running on Debian. Both servers use LOADBG/AUTO to play mixed (Linux) and XDCAM HD (Windows) playlists.

shot-justreadtheinstructions-20210129-111714

@TondaKrist
Copy link

I have to confirm, that this build fixes CPU usage leak on Windows (both Intel and AMD currently running 5 days 24/7).
Thanks guys, awesome job in investigation and fix.

Unfortunately I have experienced memery leak on GPU when HTML tempalte GPU acceleration is enabled. I will start a new thread for that.

@ronag
Copy link
Member

ronag commented Jan 29, 2021

Unfortunately I have experienced memery leak on GPU when HTML tempalte GPU acceleration is enabled.

I have also encountered this.

@dotarmin
Copy link
Contributor Author

dotarmin commented Jan 29, 2021

@TondaKrist or @ronag, can you please create an issue for this of not already done? Thanks

Never mind, already done, thanks!

@sendust
Copy link

sendust commented Feb 1, 2021

This is off-topic, but Beta-version v2.3.2-lts-beta also has audio issues on systems that use the 1001-based-standard.
#1326 already has a solution to the audio issue, and I hope users using NTSC can participate in this test.
Thanks~~

@Julusian Julusian closed this as completed Mar 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants