-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is there a flag to use GPU acceleration? #12
Comments
Update (July 2022): The latest release of DVR-Scan also includes an experimental build with CUDA support, you can grab it from: https://dvr-scan.readthedocs.io/en/latest/download/ And can use the GPU mode with
See the docs for details. Thank you for your submission @laurentopia |
Thanks, |
Hi @laurentopia; Recently, there was a pull request merged which added the Thank you! |
GPU acceleration should now be possible with the OpenCV Python bindings. It may complicate building & distributing DVR-Scan a bit, but that may be a good solution for now to keep DVR-Scan written in Python but still provide significant performance improvements. I don't have a timeline for this yet as I'm still having trouble compiling the module with CUDA support in a way suitable for distribution, but found this resource from the Edit: Also found this blog post which details the steps for compiling it on Windows. I need to understand what implications this will have for binary redistribution, but this indeed seems feasible. I plan on starting with having DVR-Scan support GPU-acceleration from the Python side, and once that's confirmed working, will then move onto releasing a Windows binary. I still need to come up with a more optimal workflow for releasing Windows binaries as the process is a bit unwieldy right now. |
thank you, I'll keep that in mind the next time i need to sift through hours of footage |
Hey @laurentopia; I'd like to keep this issue open, as I do want to investigate adding CUDA support. Also note that the next release of DVR-Scan will include a faster algorithm, as per #48. Thanks! |
Hi @Breakthrough, I hope this is the correct place for this missive. The TL;DR is that yes, indeed adding OpenCV CUDA support can, along with other important and significant optimisations, massively improve throughput. I.e. from 75fps to 1300fps. One caveat: I completely recreated the fundamentals of DVR-Scan in C# (fully managed and unsafe as required), as my Python skills are essentially non-existent, and I needed something significantly more performant as a matter of urgency. However some if not all the optimisations should carry across directly to the Python version. Rather than detail the specifics in one giant wall of text, let me know if/when you're currently looking at this issue, and I would be more than happy to share the exact details and mechanics of the performance gains. Best regards, CJPN Edit: gfx is Nvidia RTX 2080 Super Some screenies of DVR-Scan and the optimised setup working on the same source material (H.264 @ 1920x1080) with identical settings. C# version streams multiple files or separate timespans within the same file. Four streams, in this instance, is enough to saturate the GPU: |
Awesome result - would be glad to learn more about how you went about it. I hope to eventually find something that targets both Nvidia and AMD GPUs, but even something CUDA based is better than nothing. |
Indeed there are suggestions within OpenCV's docs - this being an imminent target of further research on this end - that utilising UMats as opposed to Mats automagically enables OpenCL processing with OpenCV choosing the fastest device available and falling back to CPU as required. While CUDA is fun n'all that, it's proprietary nature does bring about a certain discomfort. Particularly given current hardware shortages. |
Well. OpenCL running on the GPU does indeed automagically work. Performance is, however, nowhere close to the exact same algorithm operating explicitly in CUDA. It appears as if OpenCV is uploading/downloading the UMat to/from the GPU between every processing step - i.e. GPU bus usage is higher with the OpenCL version running at 350FPS than the CUDA version running at 1300FPS - something that one can easily and explicitly prevent from occurring with OpenCV/CUDA. Nevertheless, 350fps is an improvement over 75FPS, eh? Further investigation required. Edit: By jiggling around what runs on CPU and GPU this is up to 550FPS. |
The difference between CUDA and OpenCL isn't as bad as initially thought. And is somewhat counterintuitive. 4x 1080p Running CUDA Thread 1 @ frame 10622 (1920x1080) @ 307.44 FPS @ 0.64 Gpx/s. 4x 4K Running CUDA Thread 1 @ frame 1511 (3840x2560) @ 61.32 FPS @ 0.6 Gpx/s. 4x 1080p Running OpenCL Thread 1 @ frame 2824 (1920x1080) @ 126.35 FPS @ 0.26 Gpx/s. 4x 4K Running OpenCL Thread 1 @ frame 1716 (3840x2560) @ 34 FPS @ 0.33 Gpx/s. |
@CJPNActual is there any source code you can share for your benchmarks? Would love to integrate something like this into DVR-Scan, but I'm not that familiar with OpenCL. Thanks! |
This seems extremely promising both for cuda and opencl. Any chance you can share the source code? |
Copied from recent email: I rewrote the entire concept in C#, with sections of unsafe (C++ style) code for added memory voodoo where using FFMPEG for video decoding is concerned, so am not certain it translates directly to Python with a source-share... However, getting the base-concept to run on CUDA or OpenCL is as simple as redefining some of the OpenCV textures as "graphics card resident" and then calling the same functions on them with the now GFX-resident textures/bitmaps as arguments as opposed to the "CPU-resident" ones. OpenCV will deal with all of the difficult copy-to-gfx-card stuff internally. The code on your end should look almost identical with minor changes as mentioned. Does that make sense? |
Yes, totally. Could you point me to any code samples for that in OpenCV 4? (C++ is fine) I am considering rewriting DVR-Scan in Rust for the next version which should support most of the C++ style stuff. Thanks! Edit: It looks like samples/gpu/video_reader.cpp and samples/gpu/bgfg_segm.cpp should be good enough for a first pass, but I'm curious what you mean by redefining as graphics card resident - do you mean using a GpuMat instead of a regular Mat? Edit2: And sorry do you have examples of that for OpenCL? I assumed OpenCV only supported CUDA for certain things. |
Is there any way to run dvr-scan with GPU acceleration? Sorry, I am totally n00b and I only can start it on Windows or Linux, where it could be possible to increase power with GPU. I would be very interested in helping in developing/testing dvr-scan, but I totally don't know where to start... Best regards, |
@ijutta it's fairly straight forward to modify the That being said, the next release of DVR-Scan will include multithreading to improve performance. It will still be slower than using a GPU for calculation, but should be at least 50% faster than the current version. Once that's done, I'll pick this up for v1.6 and at least try to support versions of OpenCV that are compiled with CUDA support at minimum as an experimental feature. In the meantime, I'm more than happy to help out or point you in the right direction for adding this to DVR-Scan. I'm working on refactoring the application as a whole to make it more easy to integrate GPU/multithreaded support in general (the current application has accumulated a lot of technical debt), but any proof of concepts/PRs are more than welcome. |
@ijutta interestingly enough, I just happened to find someone distributing prebuilts for Windows: I'll see how difficult it is to mock something up to at least get a decent performance comparison and get back to you. I still need to see how to best support both CPU + GPU scanning in the long run, but will keep this in mind during the v1.5 refactor to make that an explicit goal. Edit: Wasn't able to get it to work due to some missing dependencies, but will give building a custom version a try. If that works, I'll create a branch where people can test this out. |
I managed to get this working! There's still plenty of work to be done, but you can download and install an experimental version to test out (make sure to uninstall any existing versions of DVR-Scan):
Once installed or extracted, you can use the
I would recommend comparing performance in scan-only mode ( If folks could help out by testing this, that would be fantastic (esp. regarding performance). |
(((Strong disclaimer! I possess virtually no Python expertise.))) Running the new experimental build immediately complains that:
|
Apologies. Didn't see the edits until now. While, in my tests, OpenCL performs at ~50% of CUDA, obviously not everybody possesses an Nvidia card.
It's been a while since I tested OpenCL, yet my observation - at least using Emgu.CV, the .NET wrapper - is that rather than explicitly creating an OpenCL pipeline as one would with CUDA, one instead instructs OpenCV as a whole to utilise OpenCL via its "native" interface thusly:
Subsequently one calls one's pipeline via the native interface, with the exception of the background subtractor which one calls normally via it's member .Apply() function. I believe there may be one caveat in that certain Mats need to be UMat, but otherwise it's relatively easy to port from CPU-only code. Easier than CUDA, one might posit. Disclaimer: The CUDA code on this end isn't hard-coded to 1920x1080 and so on. :)
|
Other optimisations worth noting:
*To be fair to ffmpeg, the issue, perhaps, pertains to the bizarrely non-conformant video spat out by certain DVRs. Want a consistent framerate? Yeah, but nah. How about I-frames at a relatively useful interval? Again, above the DVR's pay grade. |
@CJPNActual are you sure you uninstalled any existing versions of DVR-Scan before installing the experimental version? What do you see if you run The main focus for v1.5 will be to implement a multithreaded model similar to what you've suggested, doing frame decoding in a separate thread, and offloading video encoding to ffmpeg in a subprocess (trying to have DVR-Scan output a list of cuts and video filters to overlay timestamps/bounding boxes where required). There's much room for performance improvements though, and using Python for tighter ffmpeg/CUDA integration is difficult in the current package ecosystem. I want to investigate possibly rewriting the project in Rust, to allow tighter integration with the various C/C++ libraries and allow better control over memory management, threading/locking, and integration with the ffmpeg API. I suspect this would also bring performance much closer to the figures you've been able to get. Doing all of this isn't impossible in Python per-se, but it definitely feels like making all of those parts work together would be much easier in a statically typed compiled language. |
Ah. Exactly what I didn't do. Still reading v1.3.
As mentioned, beware janky video from actual DVRs/NVRs. Asking ffmpeg to seek to a particular timestamp in a "perfect"-encoded video is trivial. The files I'm getting out of a mid-range HikVision NVR are unpredictable at best, so the only reliable timestamp one has is the frame number. Indeed I've some code here attempting to convert between frame# and timestamp in reality, yet remains to be perfected. Incidentally, it may be faster to implement sub-frame region interest merely by applying a mask multiplication on the GPU, then leaving ffmpeg to extract the region of interest after the fact, should the end-user require only the sub-frame as video. The logic there in that one can define multiple regions of interest without having to expensively extract each one and process individually.
Yeah, precisely why I reimplemented it in C# with unsafe enabled. I've significant former work with real-time video, so it's a safe space. However, not everyday an interesting case occurs for processing video at gigabytes per second, so entertained this as a continuation-of-training exercise on my part. |
I've also uploaded an experimental .exe build for 64-bit Windows systems (requires an Nvidia GTX 900-series or above GPU) which you can grab here. Make sure you have CUDA support enabled with your current GPU driver, e.g. when clicking System Information in the Nvidia Control Panel, you should see NVCUDA64.dll under the Components tab, e.g.: Feedback on both this and the experimental Python version above is most welcome. |
Have the multithreaded version ready for testing in the v1.5 branch, just need to cleanup a few things before making another release candidate. From the experimental version above though, now I'm getting closer to 150 FPS from 100 FPS before when using CUDA mode. In CPU mode, using MOG, I can now get close to real-time full frame processing (~60FPS). Edit: This is with a 1080p video on an i7 6700k w/ a Nvidia GTX 2070. This opens up a lot of doors now, but for the time being this is as much optimization as I can commit to for v1.5. Of course, if there's any optimizations folks might find or be able to help out with, I would be more than happy to include those in v1.5. Now that there's GPU support in place, once 1.5 is released, I may close this issue and create a new one specifically to focus on optimization opportunities (including those brought up in this issue). As always, folks can test it by grabbing one of the last passing v1.5 builds (.whl archives are uploaded on each builder under the artifacts). |
@Breakthrough Sorry for being a pedant, however given that we're currently and collectively engaged in performance optimisation and/or research, could we speak in Gigapixels/second or some such, as it removes input resolution and framerate from the equation. Also worth mentioning one's hardware setup, as I'm relatively certain a Raspberry PI performs differently to... :) |
v1.5 has officially been released, looking forward to any feedback for the new CUDA builds. Will close this issue, but feel free to create new issues or discussions if any issues crop up. As mentioned previously there are likely some areas of improvement regarding performance still, so happy to have any new ideas as well. Will leave this issue pinned for the meantime. |
If there are any particular PoCs you want to conduct on the performance side using CUDA, or look at some sections of code for optimizations - give me a shout. currently my 750Ti only around 5-10% utilised while processing large files using MOG2_CUDA, so i suspect more could be done. Out for two weeks, but will be happy to give it a bash once back. |
OpenCV has a build using CUDA I think, so I was wondering if GPU acceleration flag existed in DVR-scan
The text was updated successfully, but these errors were encountered: